Downloaded from orbit.dtu.dk on: Mar 30, 2019

Genus-level studies of gene dynamics for the Aspergillus genus

Theobald, Sebastian

Publication date: 2018

Document Version Publisher's PDF, also known as Version of record

Link back to DTU Orbit

Citation (APA): Theobald, S. (2018). Genus-level studies of gene dynamics for the Aspergillus genus. Kgs. Lyngby: Technical University of Denmark (DTU).

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.  You may not further distribute the material or use it for any profit-making activity or commercial gain  You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Genus-level studies of gene dynamics for the Aspergillus genus

PhD thesis by Sebastian Theobald

Technical University of Denmark

January 2018 2 Contents

1 Introduction 13 1.1 Fungi and their impact on society ...... 13 1.2 Non ribosomal peptide synthetases (NRPS) ...... 15 1.3 Polyketide synthases (PKS) ...... 17 1.4 Polyketide - non ribosomal peptide synthetase hybrids . . . . 19 1.5 Genome research in Aspergilli ...... 20 1.6 Approaches for Comparative Genomics in Aspergillus and Peni- cillium ...... 21

2 The genomes of Aspergillus section Nigri reveal drivers in fungal speciation 63

3 Uncovering bioactive compounds in Aspergillus section Ni- gri by genetic dereplication using gene cluster networks 81

4 Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli 93

5 Comparative genomics of A. nidulans and section Nidu- lantes 109

6 Conclusions and perspectives 121

A Appendix 123 A.1 Supplementary information: The genomes of Aspergillus sec- tion Nigri reveal drivers in fungal speciation ...... 123 A.2 Supplementary information: Uncovering bioactive compounds in Aspergillus section Nigri by genetic dereplication using sec- ondary metabolite gene cluster networks ...... 145 A.3 Supplementary information: Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli . . . 151

3 A.4 Supplementary information: Comparative genomics of A. nidu- lans and section Nidulantes ...... 156

4 Contents

Preface

This thesis serves as a partial fulfillment of the requirements to obtain a PhD degree of the Technical University of Denmark (DTU). The work was carried out at the Department of Biotechnology and Biomedicine under the supervision of Prof. Mikael R. Andersen, Assistant Prof. Tammi C. Vesth, Prof. Thomas O. Larsen and Anders G. Pedersen. The project has been funded by the Villum foundation (grant VKR023437).

Kongens Lyngby, January 2018

Sebastian Theobald

5 Acknowledgements

This project would have not been possible without the many wonderful people involved who supported me professionally and personally. First, I would like to thank my supervisors Mikael Rørdam Andersen and Tammi Camilla Vesth for giving me the opportunity to pursue this project and mentoring me during the three years. You were putting me on the right path when I was getting lost in the details of the project — which in my case happens quite frequently. I learned so much during these three years. Thank you for making this possible and also teaching me that there’s much more to science than just the project. Furthermore, I would like to thank the whole Network Engineering of Eukaryotic Cell Factories group. Especially Jane Nybo, the best office mate you can wish for, with constant positive and energetic attitude, Inge Kjærbølling for also being full of positive energy and curiosity, and Julian Brandl for his humor and supporting me with python. It was a great experience to work and travel together with you guys to so many different places. This mostly bioinformatic project tried to bridge the gap between com- parative genomics and molecular biology, which would have not been possible without the help of my colleagues. I would like to thank Jakob B. Hoof and Jakob K. Rendsvig for a great collaboration to elucidate the malformin gene cluster and Thomas Ostenfeld Larsen, who joined as co-supervisor later in the process, for his advice and mentoring regarding the chemical aspects of this thesis. Also I would like to thank Matthias Brock for the opportunity to visit his laboratory at University of Nottingham and Elena Geib for col- laboration. Here also a big thanks to Justyna Kuska! It was great to have a small masters reunion in the UK! DTU Bioengineering has been a great place to work and I want to thank everyone in building 223 and 221 for creating such a good atmosphere. I’m very grateful that I could work together with so many inspiring people during my PhD — I definitely learned a lot from everyone. Also, I would like to thank my family and friends for their constant sup- port they gave me during these past three years. Finally, a very special thanks goes to Angelaˆ de Carvalho for the love, the endless support, and for just always being there.

6 Abstract

The fungal genus Aspergillus has a great impact on society. It includes species with industrial value A. niger and A. oryzae, medically relevant species As- pergillus fumigatus and A. terreus, and the model organisms A. nidulans. A. nidulans, chosen as model organism due to its sexual cycle and pos- sibility to study genetics using spore color, provided insights into key regu- lators of cell cycle control, development and secondary metabolism. Results obtained from studies on A. nidulans were found to be applicable on other Aspergilli as well. Genome sequencing of A. nidulans, A. fumigatus and A. oryzae further revealed a large phylogenetic distance between species of the genus. Hence, conclusions derived from studies on model organisms might not apply to the genus as a whole but rather on smaller taxonomical groups. Taxonomy of Aspergilli is constantly under evaluation and the information added by new methods and technologies is aiding the definition of new taxonomic groups (Chen et al., 2016; Hubka et al., 2016). In the 300 Aspergillus genome project, we are extending the set of se- quenced species to further investigate genome and gene diversity across the entire genus. This specific project deals with the diversity of secondary metabolite (SM) genes and conservation of regulators. To achieve a com- parison of several species at once, we created methods which classify genes into families. This approach highlighted that regulators like mcrA, a master regulator of secondary metabolism, is conserved throughout Aspergilli and that galR, a regulator which was thought to be unique for A. nidulans is present in many other species. SM genes needed a slightly different type of clustering since they show high similarity between each other — making them difficult to separate. The full pathway for a secondary metabolite is encoded on a spatial collective of genes, a secondary metabolite gene cluster (SMGC). Thus, to find families of SMGC for production of similar SMs we compared whole gene clusters against each other. Our analysis highlights differences of SMGC families shared on several taxonomic levels. Characterizing regulators and SMGC over several species as families has the advantage that we are able to follow their distribution throughout the genus. This has several benefits. Examination of conserved genes can give insights into the adaptations of a section of Aspergilli. Examining SMGC families can give hints to uninvestigated SMGC producing promising drug leads. Overall, analyzing Aspergillus species with this comparative approach will reveal their gene dynamics and diversity and give insights into adapta- tions of sections throughout the genus.

7 Dansk resume

Den filamentøse svampeslægt Aspergillus har stor indflydelse p˚asamfundet. Slægten indeholder arter med stor industriel værdi (f.eks. A. niger og A. oryzae), medicinsk indflydelse (f.eks. A. fumigatus og A. terreus), og biol- ogisk model organismer (f.eks. A. niger og A. nidulans). A. nidulans blev en klassisk model organisme grundet sin seksuelle cyklus og muligheden for at studere genetik ved hjælp af sporefarve. Dette har givet indsigt i essen- tielle regulatorer af cellecykluskontrol, udvikling og sekundær metabolisme. Studier foretaget i A. nidulans har vist sig at være anvendelige i flere af arterne fra Aspergillus slægten. I 2005 blev A. nidulans, A. fumigatus og A. oryzae genom sekventeret og dette afslørede en langt større fylogenetisk afstand mellem arterne end før antaget. Derfor er det vigtigt ikke at antage at konklusioner fundet i studier af model organismer kan overføres direkte p˚aslægtensom helhed, men snarere p˚amindretaksonomiske grupper. Tax- onomien for Aspergillus slægten er konstant under evaluering, hvor informa- tion opn˚aetvia nye metoder og teknologier understøtter definitionen af nye taxonomiske grupper (Chen et al., 2016; Hubka et al., 2016). Vi vil, igennem Aspergillus hel-slægts sekventeringsprojektet, udvide sættet af sekventeret arter for yderligere at undersøge genom og gendiversiten over hele slægten. Denne afhandling omhandler diversiteten af sekundære metabolit (SM) gener og konserveringen af regulatorer. For at sammenligne flere arter p˚a´en gang, har vi skabt metoder der grupperer gener i familier af samme funktion eller produkt. Disse metoder har fremhævet at regulatorer som mcrA, en master- regulator af sekundær metabolisme, er bevaret i hele Aspergillus slægten og at galR, en regulator der blev anset for at være unik i A. nidulans er til stede i langt flere arter. SM gener bruger en speciel type gruppering, da de viser høj genetisk similaritet mellem hinanden — hvilket gør dem vanske- lige at adskille. Den fulde pathway af en sekundær metabolit er kodet i et spatial kollektiv af gener, et sekundært metabolit genkluster (SMGC). For at finde familier af SMGC til produktion af tæt beslægtede SM’er har vi sammenlignet hele genkluster mod hinanden. Vores analyser viser forskelle og ligheder mellem SMGC-familier, der deles p˚afleretaksonomiske niveauer. Karakterisering og gruppering af regulatorer og SMGC over flere arter har det fortrin, at vi er i stand til at identificerer deres tilstedeværelse og følge deres fordeling over hele slægten. Dette har flere fordele. Undersøgelser af konserveret gener kan give indsigt i evolution og tilpasning af sektioner i As- pergillus slægten. Undersøgelser af SMGC-familier kan lede til nye SMGC der potentielt producerer lovende lægemidler. Samlet set vil analyser af Aspergillus arter med denne sammenlignings metode afslører arternes gen- dynamik og diversitet, samt give indsigt i tilpasninger af sektioner i hele

8 slægten.

9 Publications

Book chapter included in this thesis • Approaches for Comparative Genomics in Aspergillus and Penicillium Nybo, J.L.*, Theobald, S.*, Brandl, J., Vesth, T.C. Andersen, M.R., Aspergillus and Penicillium in the Post-genomic Era. de Vries, R.P., Benoit Gelber, I. & Andersen, M.R. eds., June 2016, Caister Academic Press, Ch. 4.

Submitted manuscripts • The genomes of Aspergillus section Nigri reveal drivers in fungal spe- ciation Vesth, T.C., Nybo, J.L., Theobald S., Frisvad, J.C., Larsen, T.O., Nielsen, K.F., Hoof, J.B.,Brandl, J., Salamov, A., Riley, R., Nielsen, M.T., Lyhne, E. K., Kogle, M. E., Strasser, K.,McDonnell E.,Barry, K., Clum, A.,Chen C., Nolan, M., Sandor, L., Kuo, A., Lipzen, A., Hain- aut, M., Drula, E., Tsang, A., Henrissat B., Wiebenga, A., M¨akel¨a, M.R., de Vries R.P., Grigoriev, I.V., Mortensen, U.H., Baker, S.E.*, Andersen, M.R.*

• Uncovering bioactive compounds in Aspergillus section Nigri by genetic dereplication using secondary metabolite gene cluster networks Theobald, S., Vesth, T., Rendsvig, J.K., Nielsen, K.F.,Riley, R., de Abreu, L.M., Asaf Salamov, Jens Christian Frisvad, Thomas Ostenfeld Larsen, Mikael Rørdam Andersen*, Jakob Blæsbjerg Hoof*

Manuscripts in preparation • Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli Theobald, S., Vesth, T.C., Andersen, M.R.

• Comparative genomics of Aspergillus nidulans and section Nidulantes Theobald, S., Vesth, T.C., Nybo, J.L., Kjærbølling, I.,Riley, R., Salamov, A., Lyhne, E. K., Kogle, M.E., Frisvad, J.C., Hoof, J.B., Mortensen U.H.,Dyer, P., Momany, M., Larsen, T., O., Baker, S., Andersen, M.R.

Manuscripts in preparation and not included in this thesis • Comparative genomics and investigation of melanin biosynthesis path- ways in Aspergillus section Terrei

10 Theobald, S., Vesth, T.C., Nybo, J.L., Geib E., Kjærbølling, I.,Riley, R., Salamov, A., Lyhne, E. K., Kogle, M.E., Frisvad, J.C., Hoof, J.B., Brock, M., Mortensen U.H., Larsen, T. O., , Baker, S., Andersen, M.R.

• Comparative genomics in Aspergillus sections Usti and Cavernicolus Nybo, J.L., Vesth T.C, Theobald, S., Frisvad J.C, Larsen T.O., Kjærbølling I., Lyhne, E.K., Kogle M.E., Phippen C., Barry K., Clum A., Chen C., Nolan M., Sandor L., Lipzen, A., Kuo A., Salamov A.A., Drula E., Ri- ley R., Henrissat B., Hainaut M., Wiebenga A., M¨akel¨aM.R., de Vries R.P., Grigoriev I.V., Baker S.E.*, Andersen M.R.*

11 Thesis aim and structure

The aim of this thesis is to characterize gene dynamics on a genus level. Furthermore, a new method to characterize secondary metabolite genes had to be established. This thesis describes a new method to characterize spatial collectives of genes for production of secondary metabolites (SMs): secondary metabolite gene clusters (SMGCs), over several genomes. Aspergillus section Nigri and Nidulantes are used as case studies to:

1. Classify SMGC into SMGC families producing similar SMs

2. Characterize the SMGC diversity on different taxonomic levels.

3. Provide gene leads for elucidation by combining presence of SMGC families with metabolite production data.

4. Characterize SMGC families using a guilt by association approach.

5. Show the phylogenetic history of one SM class

6. Identify distributions of regulators throughout the genus

First, the introduction provides an overview of the importance of fungal biology, secondary metabolites, Aspergillus genome research and approaches for comparative genomics in Aspergillus and Penicillium. Chapter 2 includes the developed method for SMGC family creation and addresses points 1,2 and 3. Chapter 3 builds up on the preceeding chapter, extends the analysis performed under point 2 and adds point 4. Chapter 4 extends the dataset by more species outside section Nigri and shows the evolution of a hybrid SM class. Chapter 5 illustrates the general applicability of the method by using section Nidulantes. Points 1-4 and 6 are addressed here.

12 Chapter 1

Introduction

1.1 Fungi and their impact on society

The genus Aspergillus, a taxon of filamentous fungi within the Ascomycetes, is of vast importance in different areas of society. In nature, they act mostly as plant degraders and to cope with the complex plant cell wall components, they secrete a diverse repertoire of hemicellulases, pectinases and cellulases. These enzymes are useful in industrial applications and many of these are pro- duced by the important cell factory Aspergillus niger CBS 513.88 (de Vries et al., 2001). In food-related applications, e.g. juice production, the enzymes degrade cell wall components that would usually make the food processing inefficient. Fungal enzymes are used to ease the process and maximize yield at low cost. Furthermore, fungal xylanases and endoglucanases are used to separate gluten and starch from flour (Moore et al., 2011), thus creating the desired product. Another isolate of A. niger, ATCC 1015, is the parent strain of the strains used in industry in the production of e.g. citric acid (E330), a food preservative and flavor enhancer, and gluconic acid, a food additive and acidity regulator. Another fungus, A. oryzae, is heavily used in the food industry for the production of e.g. soy sauce and miso. Apart from their industrially useful enzymes, Aspergilli also produce secondary metabolites (SMs) which can be both problematic and useful compounds. A. flavus contaminates stored food stuffs with strong car- cinogen aflatoxin (Klich, 2007). Aflatoxin is then ingested by consumers where it causes liver cancer and suppresses the immune system. It can also be transferred from mothers to newborns, creating an even greater health threat(Rafiei et al., 2014). Aspergillus species of section Nigri affect food stocks like e.g. cashew nuts where contaminating Nigri species produce fu- monisins, ochratoxin A, and secalonic acid (Lamboni et al., 2016). These

13 contaminations can affect the economy. For instance, cashew nuts are an im- portant export product of west African countries, which in the case of Benin made up 8% of export revenues in 2011 (Lamboni et al., 2016). SMs are also found throughout other Ascomycete species than Aspergilli. Cochliobolus heterostrophus (or: Bipolaris maydis) is another Ascomycete from the class of Pleosporaceae and causes corn leaf blight in maize. Yang (1996) identified a protein responsible for the necrotrophic effector T-toxin, a SM, which affects T-cytoplasm corn — a male sterile variant corn. Mag- naporthe grisea, a Sordariamycete produces a SM, through the protein Ace1, that is an avirulence factor triggering resistance in rice (Oryza sativa) plants which carry the resistance gene Pi33 (B¨ohnertet al., 2004). Ace1 is only expressed during plant infection; any disturbance of the process abolishes Ace1 expression (B¨ohnertet al., 2004). Despite the toxic and virulence promoting SMs, many can be used as pharmaceuticals, e.g. to treat bacterial and fungal infections. Penicillium chrysogenum (now identified as P. rubens) for example, has been long used for the production of the antibiotic penicillin — which has also been dis- covered in A. nidulans (van Liempt et al., 1989). Another SM isolated from fungi, lovastatin (Rubinstein et al., 1991), has been widely used as a cholesterol-lowering drug and might have the same impact on society as penicillins. Other compounds are investigated for their pharmaceutical prop- erties like e.g. cytochalasins (Petersen et al., 2014b) and malformins (Hagi- mori et al., 2007b). Hence, it is of interest to investigate the enzymatic and genetic basis for SMs. SMs are synthesized by different classes of enzymes. In fungi, these are polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), en- zymes only consisting of a smaller subset of modules (PKS-Likes, NRPS- Likes), fusions of PKS and NRPS (PKS-NRPS/ NRPS-PKS hybrids) terpene cyclases (TC), and dimethylallyl transferases (DMATS). These enzymes pro- duce a molecule that is further modified by tailoring enzymes which e.g. add additional groups or cyclize the initial molecule. The enzymes necessary for production of a SM are encoded by a collective of genes — a secondary metabolite gene cluster (SMGC). This thesis concentrates mostly on PKS, NRPS and derivates of these SM genes. Sequencing of fungal genomes facil- itated to reveal the genetic basis for many SMGC. With the first Aspergillus species sequenced in 2005 (Galagan et al., 2005), some SMGC could be elu- cidated. The genetic basis for the majority of SMs, however, still remained hidden since genome mining for SMGCs is laborious. Additionally, the evo- lution of SMGC, leading to analogous compounds, was impossible to uncover because only few Aspergilli were sequenced.

14 1.2 Non ribosomal peptide synthetases (NRPS)

Non-ribosomal peptides (NRPs) constitute a major group of secondary metabo- lites, like e.g. the anticancer compound nidulanin (Klitgaard et al., 2015), anticancer drug enhancing malformins (Hagimori et al., 2007a), the iron- transporting siderophores (Eisendle et al., 2003), antifungal echinocandins (Bills et al., 2016), and antimicrobial penicillins (Wright, 1999). The latter was first discovered in P. chrysogenum (Diez et al., 1990), and shortly after, under certain fermentation conditions, in A. nidulans (Holt and MacDonald, 1968; MacCabe et al., 1991). This wide array of bioactive NRPs is synthesized by non-ribosomal pep- tide synthetases NRPSs. These enzymes can — as opposed to ribosomal pep- tide synthesis — utilize L-amino acids, as well as non-proteogenic amino acids as their substrate (Finking and Marahiel, 2004), hence creating compound diversity. They consist of one or more modules, which typically contain three domains (Weissman, 2015; Marahiel, 2016); an adenylation (A) domain for loading of amino acids, a peptidyl carrier protein (PCP) domain, also called thiolation (T) domain for peptide chain transfer, and a condensation (C) domain for peptide bond formation. Other domains like the epimerisation (E) domain change the chirality of their proximate amino acid. A-domains choose the amino acid according to the Stachelhaus-code — a sequence in- side the A-domain in their active site (Stachelhaus et al., 1999). First, amino acids are recognized and activated by the A-domain, and react with ATP to aminoacyl-adenosinmonophasphate (aminoacyl-AMP) which can then form a aminoacyl thioester with the panthetheine group of the T-domain (a pro- cess analogous to activation of amino acid for ribosomal peptide synthesis) (Evans, 2016). Condensation domains show different subtypes that are de- pendent on the stereoisomer of the preceding amino acid: starter C domain as initial C domain, CLL domain for joining two L-amino acid, CDL domains for joining of D and L configured amino acids — selecting the correct enan- tiomer (Clugston et al., 2003)— and heterocyclizing C domains (Finking and Marahiel, 2004). Epimerising C domains (E) are a special subtype. These domains switch the stereochemistry of an amino acid from L to D. Dual epimerisation-condensation domains show both functions. Surprisingly, ex- amples of NRPS following different synthesis schemes are increasing. Enni- atin, enterobactin, bacillibactin and gramicidin S are synthesized by iterative NRPS, meaning that their modules are reused (Mootz et al., 2002). Penicillin is produced from joining L-aminoadipic acid — a product of L- lysine modification — with L-cysteine and L-valine (Fig. 1.1). Each amino acid is recruited by the adenylation domain and loaded onto the T domain. L-aminoadipic acid and cystein are condensed and subsequently, L-valine is

15 isomerized by the E-domain to its D form and the C-domain condenses D- valine with the dipeptide. The thioesterase domain releases the tripeptide which is further processed by tailoring enzymes to form penicillin. In general, the release of peptides is either by hydrolisis, leading to a linear peptide, or lactamization, leading to a cyclic peptide (e.g. nidulanin (Klitgaard et al., 2015)).

Malformin, previously considered as antibiotic against bacteria (Suda and Curtis, 1966), has been shown to be a cytotoxin which arrests cells in G2 and M phase and subsequently induces cell death (Wang et al., 2015a). Cancer cells only repair DNA damage in G2 stage (Hartwell and Kastan, 1994), hence this is a potential anticancer drug enhancer. Malformin is a pen- tapeptide of Val, D-Leu, Ile and two D-Cys which contains a disuflide bridge between the two cysteins. Its NRPS has not been elucidated until now (see chapter 3). Nidulanin A, another NRP, is a cyclic tetrapeptide consisting of L-phenylalanine (Phe), L-valine (Val) and D-Val, one L-kynurenine residue (a metabolite of L-Trp) and isoprene (Klitgaard et al., 2015). These com- pounds emphasize that NRPS in fungi show a wide variety of structures, incorporating various monomers and epimerizing amino acids. The applica- bility of non ribosomal peptides is shown by the function of penicillin and malformin. The investigation of secondary metabolism in Aspergillus can benefit society with candidates for pharmaceutical compounds.

Bioinformatic work has greatly assisted in the investigation of NRPS do- mains and diversity. Known examples of NRPS have been used to create prediction tools for adenylation domain specificity, like e.g. NRPSpredic- tor (Rausch et al., 2005). Due to their conservation, these domains have also been the primary subject for phylogenetic and phylogenomic studies. Phylogenomic studies identified nine NRPS subfamilies throughout 38 fun- gal genomes (Bushley and Turgeon, 2010). Several subgroups are suspected to be of bacterial origin, emphasizing that horizontal gene transfer (HGT) from bacteria to fungi is common (Bushley and Turgeon, 2010). Another key finding was that a group of multimodular NRPS is restricted to fungi and extensive gain and loss of domains drives NRPS evolution. Addition- ally, NRPSs drive chemical innovation through point mutations, rearrange- ments, duplication and deletion of domains/modules, creating an even higher chemical diversity through incorporation of new substrates (Fischbach et al., 2008). Other domains which have been investigated are the C-domains. Phy- logenetic analysis of C-domains revealed a common origin of several classes (Rausch et al., 2007).

16 A T C A T C A T E TE

S SO SO 1 O CH3 2NH 2NH

SH CH3

2NH OH L-Cys L-Val O

L-α-aminoadipic acid

A T C A T C A T E TE

O S SO 2 CH SH NH 3 2NH

O CH3

2NH O OH

A T C A T C A T E TE

3 SO O

O NH CH3 NH

CH3 SH

2NH

OH O

OHO O NH H S O NH CH3 NH CH 4 O 3 CH3 CH SH N 3 O 2NH OH O OH O

Figure 1.1: Synthesis of δ-L-α-aminoadipyl-L-cysteinyl-D-valine, a penicillin precursor, by an NRPS synthetase. 1: L-cys-thioester condenses with L-α- aminoadipyl-thioester to form a dipeptide. 2 L-Val is epimerized to D-Val and condenses with the dipeptide. 3 δ-L-α-aminoadipyl-L-cysteinyl-D-valine (ACV) is released. 4 tailoring enzymes process ACV to penicillin G

1.3 Polyketide synthases (PKS)

Polyketides, the most abundant secondary metabolites in Aspergillus species, are a class of diverse compounds with a wide range of bioactivities. These compounds include mycotoxins such as sterigmatocystin (Purchase and Van der Watt, 1973) and the highly carcinogenic aflatoxin (Coulombe and Sharma, 1985), while others are insecticides as in the case of dehydroaustinal (Valiante et al., 2017). They can also have medical properties like the cholesterol- lowering drug lovastatin (Rubinstein et al., 1991). Polylketide synthases consist of several domains to synthesize these com- pounds. In the first step of polyketide synthesis, the acyl transferase (AT) domain loads acetyl-coenzyme A (acetyl-CoA) on the pantetheine arm of

17 CH SEnz acetyl-Coa 3 + malonyl-Coa OOO KS AT DH MT KR ACP

reduction of carbonyl to alcohol group

OH CH3 CH3 O -H2O O SEnz SEnz

O O

addition of malonyl-CoA

CH3 CH3 CH3 aldol enoylization O reaction SEnz hydrolysis SEnz OEnz -H2O

O O O O OH O

Figure 1.2: Synthesis of 6-methylsalicilic acid and structure of PKS. The 6-msa synthetase is a good example of a minimal fungal iterative type I PKS and illustrates the functions of reducing domains. Adapted from (Dewick, 2009) the ACP domain which then transfers acetyl-CoA to the β-ketoacylsynthase (KS) domain. The extender unit malonyl-coenzyme A (malonyl-CoA) is loaded onto the ACP domain and the KS domain condenses acetyl-CoA and malonyl-CoA to yield the growing polyketide chain. Subsequently, ketore- ductase (KR), dehydratase (DH), enoylreductase (ER) and methyltransferase (MT) domains can further reduce the polyketide. Depending on their β-keto processing activity the final polyketide chain can be highly reduced (HR), partially reduced (PR) resulting in macrolides, and non-reduced (NR), which will results in aromatics — classifying the PKS. The KR domain reduces the β-ketoester to a hydroxyester, DH domain dehydrates to an conjugated es- ter while the enoyl reductase will further reduce it to a reduced ester. The extending and processing steps are repeated until the polyketide has reached its final length, is loaded on the thioesterase (TE) domain, and released. The 6-methylsalicyllic acid (6-MSA) synthase is a good example of a minimal fungal iterative type I PKS (Fig. 1.2). Acetyl-CoA and two malonyl- CoA are condensed to a poly-β ketoester, which is subsequently reduced and dehydrated to form a double bond. After condensation of another malonyl- CoA an aldol and enolization reaction further modify the ketoester until it is hydrolized and released from the enzyme as 6-MSA. The 6-MSA synthetase is a proposed progenitor of other SMs, like e.g. patulin and yanuthone D, especially in the genus Aspergillus (Frisvad and Larsen, 2015).

18 Phylogenomic analysis of KS domains indicated eight groups of fungal PKSs (Kroken et al., 2003). One ancestral PKS is orsellinic acid synthase (orsA), which has undergone a vast duplication event in Pezizomycotina (Koczyk et al., 2015). This event is responsible for the amount of non- reducing PKSs we can identify in fungi. From the orsillinic acid synthase, meroterpenoid synthases, naphthopyrone synthases and aflatoxin synthases (among others) developed, shaping the chemical diversity we see in fungi.

1.4 Polyketide - non ribosomal peptide syn- thetase hybrids

The enzymes classes producing polyketide and non-ribosomal peptides — PKSs and NRPSs — can also be fused into a PKS-NRPS or NRPS-PKS hybrid enzymes, creating a chimeric compound. Most identified fungal hy- brids have a PKS-NRPS orientation, the PKS part being highly reducing (Boettger and Hertweck, 2013). There are however, a few fungal NRPS- PKS, e.g. tenuazonic acid (Yun et al., 2015). The products of hybrids are e.g. the mycotoxin cyclopiazonic acid (Sorenson et al., 1984), the antioxidant pyranonigrin (Miyake et al., 2007; Awakawa et al., 2013) and the potential anti-cancer compound cytochalasin (Petersen et al., 2014a). For cytochalasin biosynthesis in Aspergillus clavatus, an octaketide is formed by condensation of acetyl-CoA and 8 malonyl-CoA units and, af- ter three methylations, processed by Diels-Alder cyclization (Boettger and Hertweck, 2013; Qiao et al., 2011). Subsequently, the octaketide is joined with phenylalanine to create an amide. After release, it is further modified by tailoring enzymes. The hybrid gene, ccsA, for cytochalasin biosynthesis has an interesting evolutionary history. In a study by Khaldi and Wolfe (2011), they showed evidence that ccsA is actually a syn2 homolog which was horizontally trans- ferred from Magnaporthe grisea to A. clavatus. syn2 is a duplication of ace1 we covered in section 1.1. In general, this hybrid seemed to undergo mas- sive duplications but also deletions, since it was, at the time, not found in other Aspergillus species (Khaldi and Wolfe, 2011). Other studies suggest an interkingdom transfer of a NRPS-PKS hybrid (Lawrence et al., 2011). We further investigate the role of the hybrid SM class, also regarding the ccsA hybrid, in chapter 3.

19 1.5 Genome research in Aspergilli

The first genome sequences of A. nidulans (Galagan et al., 2005), A. fumi- gatus (Nierman et al., 2005), and A. oryzae (Machida et al., 2005) provided insights into the phylogeny, evolution and diversity of these fungi. Predicted orthologs between A. nidulans and A. fumigatus shared 66% amino acid identity (Galagan et al., 2005) — which is comparable to the protein iden- tity shared between mammals and fish (Dujon et al., 2004). More Aspergillus species were sequenced later; among them A. niger (Pel et al., 2007) which provided insights into primary metabolism as well as secondary metabolism genes. The sequencing and comparison of A. niger ATCC 1015 to the se- quenced CBS 513.88 was the first study to explore differences and compara- tive genomics between two closely related species of section Nigri (Andersen et al., 2011). The study highlighted the great differences between isolates with a chromosomal arm inversion, high mutation rates, and differences in SMGC content identified in A. niger ATCC 1015. The availability of Aspergillus genome sequences also increased the amount of SMs that could be linked to their SMGCs. Since analogous SMs, com- pounds of slightly different structure, are detected throughout the genus (Frisvad and Larsen, 2015), studies were searching for similar SMGCs. The basic local alignment search tool (BLAST) (Camacho et al., 2009) can be used individually on certain species to find candidate clusters. Maximum likelihood phylogenies are an alternative to relate secondary metabolite genes and explain their phylogeny. They are limited however to conserved domains rather than the whole cluster to estimate similarity between SM enzymes. Multiple domains per enzyme make it difficult to handle in data post pro- cessing and data trimming as well as the used algorithm affect the outcome of phylogenetic analysis (Zhou et al., 2017). Additionally, different types of SMs have to be treated with individual analyses and with increasing amount of sequences getting reliable branches through bootstrapping can be a prob- lem. Projects like the 1000 fungal genomes project, the Fusarium sequencing projects, or our Aspergillus sequencing project (extending to over 300 species) require a genome mining method that highlights SMGC dynamics in several species at once. Instead of creating individual comparisons of SMGCs on species to species basis or phylogenetic trees covering each class of SM gene, we created a tool for categorization of SMGCs into SMGC families — a group of SMGCs pro- ducing similar compounds. To achieve this, we created all against all protein BLAST which extracts bidirectional best hits related to SMGC proteins. The scores of BLAST hits are then aggregated to create a SMGC to SMGC simi- larity network. Random walks (Pons and Latapy, 2005) are used on the net-

20 work to yield families of SMGCs. We show how SMGC families can be used to aid in the characterization of SMGCs — replacing laborious blast com- parisons and synteny analyses and providing strong leads for linking genes to metabolites. As a case study, we chose the SM rich genus Aspergillus to show the applicability of our method (see chapter 2,3, and 5). Furthermore, we show how our method can highlight SMGC diversity on different taxonomic levels.

1.6 Approaches for Comparative Genomics in Aspergillus and Penicillium

21 Approaches for Comparative Genomics in Aspergillus and 4 Penicillium

Jane L. Nybo,* Sebastian Theobald,* Julian Brandl, Tammi C. Vesth and Mikael R. Andersen

Abstract Te number of available genomes in the closely related fungal genera Aspergillus and Penicil- lium is rapidly increasing. At the time of writing, the genomes of 62 species are available, and an even higher number is being prepared. Fungal comparative genomics is thus becoming steadily more powerful and applicable for many types of studies. In this chapter, we provide an overview of the state-of-the-art of comparative genom- ics in these fungi, along with recommended methods. Te chapter describes databases for fungal comparative genomics. Based on experience, we suggest strategies for multiple types of comparative genomics, ranging from analysis of single genes, over gene clusters and CaZymes to genome-scale comparative genomics. Furthermore, we have examined published comparative genomics papers to summarize the preferred bioinformatic methods and parameters for a given type of analysis, highly useful for new fungal geneticists. Moreo- ver, the chapter contains a detailed overview of comparative genomics studies of key fungal traits such as primary metabolism, secondary metabolism, and secretome analysis. Finally, we gaze into a possible future of the feld by comparing the current state of fungal comparative genomics to the development in bacterial genomics, where the comparison of hundreds of genomes has been performed for a while.

Introduction Te advent of the genomic era for the Aspergillus/Penicillium group started in 2005 with a set of three publications in Nature covering the most studied organisms in the genus, Aspergillus nidulans, A. oryzae and A. fumigatus (Galagan et al., 2005; Machida et al., 2005; Nierman et al., 2005). Tese analyses gave us the frst genome-scale overview of the three individual species, but notably, a large part of all three manuscripts was dedicated to the comparison of the genomes to each other or other fungal genomes (e.g. Saccharomyces cerevisiae). It was already clear from these frst studies that a large part of the understanding of the individual species and genetic features comes from comparative genomics in addition to the initial genome analysis. Te comparative analysis gave new information about a variety of features such as mating factor variations; primary and secondary metabolism; adaptation to difer- ent environments; genome dynamics; conserved non-coding sequence; and pathogenicity. Similarly, the later publication of the genome sequence for Penicillium chrysogenum (van den

*These authors contributed equally. 22 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 44 | Nybo et al.

Berg et al., 2008) based the analyses of phylogeny and chromosome dynamics on compari- sons with the previously published aspergilli. Since then, the interest in comparative genomics has only increased. A notable appli- cation is in the strength of comparative genomics to infer annotations from well-studied model organisms, such as the ones mentioned above, to less studied species, which are being sequenced as part of current large genome sequencing eforts in both Penicillia and Asper- gilli (see ‘Current status of genomics’, below). Furthermore, genome mining of fungi using tailored algorithms for comparative genomics has long been interesting due to the wealth of natural products found in these species (Bok et al., 2006). In this chapter, we defne comparative genomics as the comparison of genetic features for large parts or whole genomes and will focus on applications and studies of this type. Te compari- son of single genes across organisms is clearly a highly valuable efect of the availability of genome sequences, but will not be addressed here. It is clear from the examples presented above, and what will be shown in the following sections, that there is a high analytical power in comparing features across genomes. In this context, we wish to provide overviews of the following:

• the current status of fungal genomics in Aspergillus and Penicillium; • the available tools and databases for fungal comparative genomics, both general and specialized applications; • results and potentials of comparative genomics studies within the feld.

Furthermore, we will present the main application areas of comparative genomics, as well as the current best practice for comparative genomics within the individual research areas. Our goal is to enable the reader to identify the current best practices for fungal com- parative genomics based on an overview of algorithms and preferred parameters in the feld.

Current status of genomics Te foundation of comparative genomics is the availability of relevant genome sequences. Currently (May 2015), 62 genome sequences from the Aspergillus (32 species) and Peni- cillium (30 species) genera are available from public repositories (Table 4.1). While the majority of the sequences are from individual species, an increasing amount of sequences are addressing multiple strains of the same species. In those cases, the primary application of the genome has been comparative genomics with the objective to identify strain-specifc traits. Examples include high pencillin production in P. chrysogenum NCPC10086 (Wang et al., 2014), studies of pathogenicity and secondary metabolism in A. fumigatus Af293 versus A1163 (Fedorova et al., 2008; Sanchez et al., 2012), and enzyme-producing phenotypes in A. niger CBS 513.88 (Andersen et al., 2011). Details regarding these cases, as well as cross- species comparative genomics will be discussed in the following sections. It is quite clear given the current number of multiple independent genome sequencing eforts, as well as the JGI Community Sequencing Programs, that the number of available Aspergillus and Penicillium genome sequences will increase rapidly in the near future. Cur- rent ongoing eforts include both re-sequencing of multiple isolates of the same species (for instance clinical isolates, basic research as for the S. cerevisiae 100 genomes project (Strope et al., 2015), or results of industrial strain improvement programs) as well as a large number 23 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 45

Table 4.1 List of Aspergillus and Penicillium species and strains with publicly available genome sequences Species Strain Reference A. acidus CBS 106.47 1 A. aculeatus ATCC16872 A. brasiliensis CBS 101740 1 A. campestris CBS 348.81/IBT28561 3 A. carbonarius ITEM 5010 A. clavatus NRRL1 Fedorova et al., 2008 A. favus NRRL3357 Nierman et al., 2015a A. fscherianus/N. fscheri Fedorova et al., 2008 A. fumigatus A1163 Fedorova et al., 2008 A. fumigatus Af293 Nierman et al., 2005 A. glaucus CBS 516.65 1 A. kawachii IFO 4308 Futagami et al., 2011 A. nidulans FGSC A4 Galagan et al., 2005 A. niger ATCC 1015 Andersen et al., 2011 A. niger CBS 513.88 Pel et al., 2007 A. niger NRRL3 A. niger van Tieghem ATCC 13496 A. novofumigatus CBS117520/IBT16806 3 A. ochraceoroseus CBS 550.77/IBT245754 3 A. oryzae RIB40 Machida et al., 2005 A. parasiticus SU-1 Linz et al., 2014 A. phoenicis ATCC 13157 A. rambelli SRRC1468 A. ruber/Eurotium rubrum CBS 135680 Kis-Papo et al., 2014 A. steynii CBS112812/IBT 23906 3 A. sydowii CBS 593.65 1 A. terreus NIH 2624 A. tubingensis CBS 134.48 1 A. ustus 3.3904 Pi et al., 2015 A. versicolor CBS 583.65 1 A. wentii DTO 134E9 1 A. zonatus 1 P. aethiopicum IBT 5753 P. aurantiogriseum NRRL 62431 Yang et al., 2014 P. bilaiae ATCC 10455 2 P. brevicompactum AgRF18 2 P. camemberti FM 013 Cheeseman et al., 2014 P. canescens ATCC 10419 2 P. capsulatum ATCC 48735 24 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 46 | Nybo et al.

Table 4.1 Continued Species Strain Reference

P. chrysogenum NCPC10086 Wang et al., 2014 P. digitatum Pd1 Marcet-Houben et al., 2012 P. digitatum PHI26 Marcet-Houben et al., 2012 P. expansum ATCC 24692 2 P. expansum R19 Yu et al., 2014 P. expansum Li et al., 2015 P. expansum PEXP Ballester et al., 2015 P. expansum PEX1 Ballester et al., 2015 P. expansum PEX2 Ballester et al., 2015 P. fellatanum ATCC 48694 2 P. glabrum DAOM 239074 2 P. italicum Li et al., 2015 P. italicum PITC Ballester et al., 2015 P. janthinellum ATCC 10455 2 P. lanosocoeruleum ATCC 48919 2 P. marnefei ATCC18224 Nierman et al., 2015b P. nordicum BFE487 P. oxalicium (decumbens) 114-2/CGMCC 5302 Liu et al., 2013a P. paxilli ATCC 26601 Berry et al., 2015 P. raistrickii ATCC 10490 2 P. roqueforti FM164 Cheeseman et al., 2014 P. rubens4 Wisconsin 54-1255 van den Berg et al., 2008 P. stipitatum/Talaromyces stipitatus ATCC 10500 Nierman et al., 2015b

1 Genomes sequenced as a part of a JGI community sequencing proposal (CSP) led by Ronald P. de Vries (CBS-KNAW, NL). Publication pending. 2Genomes sequenced as a part of a JGI CSP led by Dave Greenshields (Novozymes). 3Genomes sequenced by JGI as a part of a JBEI/DTU sequencing proposal. 4Species previously thought to be P. chrysogenum.

of de novo sequencing projects. Currently, it seems likely that the number of Penicillium and Aspergillus genome sequences combined will exceed 100 in 2016. Such a development renders traditional storage and analysis methods inefcient and increases the need for databases, which can make these sequences available to the research community, as well as provide tools for comparing the wealth of genomes.

Available genome databases with comparative genomics capabilities Te fungal research community has been fortunate in the availability of several data reposito- ries dedicated to analysis of fungal genomes or specifc features thereof. Te original source of inspiration for many of these eforts has been the immensely successful Saccharomyces 25 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 47

Genome Database (SGD) (Cherry et al., 2012), which formed the basis for the framework behind the Aspergillus Genome Database (AspGD) (Arnaud et al., 2010; Cerqueira et al., 2014). Since the launch of AspGD, several new databases and repositories have been made available, each adding specifc analysis features and/or additional genomes relevant for comparisons. Table 4.2 gives an overview of fungal and general databases currently available which give access to fungal genomes and ofer various types of comparative genomics tools. Table 4.2 denotes which databases ofer which types of services, but of course all tools are not identical even with similar scopes, and have diferent strengths. All databases further- more have individual design philosophies and are tailored towards diferent communities. For the following types of comparative genomics studies based on information available in databases, we recommend the following approaches:

• Investigation of orthologues of single genes. An efcient approach would be to look up the gene in FungiDB, AspGD or the JGI webpages, if the gene/species of interest is present in any of these, and from there navigate to the gene orthology feature of each webpage. Alternatively, one can employ the BLAST interface present at many of the sites (Cama- cho et al., 2009). To get a comprehensive overview including as many organisms as possible, use both the JGI portal or NCBI GenBank as these contain genomes not found elsewhere. • Analysis of gene clusters across organisms. Currently, the simplest approach would be to navigate to the JGI secondary metabolism feature, identify the gene cluster of interest, and from there navigate to the JGI genome browser. Here, one can examine gene conser- vation in a few related species. For a more sophisticated analysis, where the gene cluster is known, one can use FungiDB or AspGD. Both databases ofer detailed synteny mapping based on the SYBIL sofware (Crabtree et al., 2007) to a larger number of organisms. Tis allows the user to examine whether gene order has been rearranged. • Analysis of CAZymes. For both Aspergillus and Penicillium species, this is a highly inter- esting analysis of potential biomass-degrading enzymes. Such data are currently best examined by downloading it from the CAZy-database (Cantarel et al., 2009), if it is available for the organisms of choice, and performing an analysis ofine using custom comparative tools. Alternatively, the staf behind CAZy are involved in many community eforts of this type. • Untargeted analysis. In some cases, it is of interest not to look at defned/known genes, but instead compare genes of a specifc type, structure or function across organisms. For this type of analysis, FungiDB/EupathDB have highly sophisticated gene search methods, based on DNA and protein motifs, domains, signal peptides, association with specifc metabolites, etc. • Genome-scale comparative genomics. At the moment, this type of analysis is not available through any of the platforms. Tis is currently best undertaken by downloading genome data of the organisms of choice and performing these computationally heavy tasks on dedicated systems. • Download of genomics data sets. Depending on the genomes of interest, NCBI Genomes or JGI are the best sources of data. Some genomes are only found in one database but not the other, so for a comprehensive analysis, both databases must be queried. In our hands, we fnd that for most of our applications, the data format downloadable from JGI is the most fexible, as the sequence data is preformated in several diferent ways. Furthermore, 26 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P

Data ● ● ● ● ● ● ● ●

Bulk Download

● Metabolic Pathway Analysis

● ● ● ●

Cross- genome searches for annotation

Locus

● ● ● ● ● Comp. Gene information

1 1 ● ● ● ● ● other

Comparison to omics tools

● ● ● ● ● Comp. Genome Browser genomics

genomes

Comparative Visual comparison of

● ● genomics

genomes.

Secondary Metabolite Cluster comparison many

on

● ● ● ● ● comparative

track

Orthologue identifcation

fungal

browser

● ● for

Custom analysis fows

genome

a

● ● ● as

resources

Synteny analysis ● ● ● ● ● ● integrated BLAST

, , are , .

. often-used ,

, . .

. , , al al . al .

al al

of

al

al

et et data et

et et

et et

Cerqueira 2014 2013

, , . . al al

Grigoriev 2014 Wheeler 2013 Reference(s) Arnaud 2010; et Gilsenan 2012 Cantarel 2009 Aurrecoechea et Stajich 2012 Grigoriev 2012 Overview

4.2

EST/SRN-sequencing MycoCosm NCBI Table Database 1 AspGD CADRE CAZy EuPathDB FungiDB JGI genome portal 27 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 49

the JGI has a specialized interface (GLOBUS) for downloading large datasets, which is very useful for studies involving the comparison of more than a few species.

For larger or more advanced, or specialized studies, it is typically advantageous to use some of the many specialized algorithms and methods developed. Te next section gives an overview of these.

General methods for comparative genomics Methods in comparative genomics are in general based on tools used for single genome analysis, and in many cases need to be tailored accordingly to be able to manage cross- species diferences. Comparative genomics has been used to shed light on everything from identifying species-specifc genes, transcription factor (TF) binding sites, repeat elements, and single nucleotide polymorphisms (SNPs) to predicting phylogenetic trees, genome scale models, secretomes, and horizontal gene transfers. Tese approaches can generally be sorted into two broad categories: Studies identifying orthologues/paralogues (see ‘Identifcation of orthologues across species’, below) and studies identifying synteny across genomes (see ‘Whole genome comparisons and synteny analysis’, below). Selecting the right method for a specifc analysis depends on multiple factors, the most important ones being the research topic or question and the type of data involved. Te data type and quality can have a large impact on the strength of the analysis and a variety of methods have been developed to deal with items such as variations in annotation and sequencing quality. Because of these large variations and biological questions, the right tools and methods should be used in the right context. Te following section describes diferent cases and the methods used, and presents an overview table of studies and types of analysis (Table 4.3).

Identification of orthologues across species As is evident from Table 4.3, there is a wide range of applications, methods, and parameters for identifying orthologues. A key component of all these methods is to compare either DNA or protein sequences. Sequence alignment is a method to compare the DNA, RNA or proteins to identify similar regions. Tere are a large number of alignment algorithms addressing scalability, speed, and parameter optimization, where some are mentioned in Table 4.3. Te predominant alignment algorithm for comparative genomics is the Basic Local Alignment Search Tool (BLAST), which combines database searches with a fast heuristic Smith–Waterman-based local alignment approach (Altschul et al., 1990). It is perhaps the most widely used tool, but does come with some pitfalls, which warrants a detailed discus- sion. A common orthologue selection approach is to use bidirectional (reciprocal) best hits between genomes. It is a very rigorous approach that almost certainly will fnd true ortho- logues (Wolf and Koonin, 2012), but will miss as much as 60% of potential orthologues (Dalquen and Dessimoz, 2013), in that only one hit is identifed for each gene or protein. Another, less stringent approach is to use reciprocal hits in combination with additional selection criteria, among other expectancy value (e-value), alignment identity (%) and align- ment coverage (%), which increases the likelihood of identifying orthologous relations and allows the identifcation of multiple orthologues for a given gene or protein. In general, one 28 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P x Database x x x x x x x x x x x x x x Local installation x x x x x x x x x x Online x x x x x x x x x x Visualization x x x x Toolbox x x x x x Wang et al., 2014 x x x x x x x Vongsangnak et al. 2008 x , 40%

200aa

> > x x x van den Berg et al., 2008 x –05 x x x x x x x Rokas et al., 2007 x 1e

> –10 x x x x x x Pi et al. 2015 x 50% , 50% 1e

> > > –10 aspergilli

x x x x x x Pel et al., 2007 x 1e

100aa

> > and

–05 x x x x x x x Marcet-Houben et al. 2012 x , 50% 1e

> > x x x x x 100 100

Machida et al., 2005 penicillia > > of

–30 x x x x J. Liu et al. 2013 x , 40% 1e

> > –10 –05 x x x x x x x studies G. Liu et al. 2013 50% , 60% 1e 1e

> > > > x x x x x x x Kis-Papo et al., 2014 x x x Inglis et al., 2013 x x x x genomics Gibbons and Rokas, 2009

x x x x x x Galagan et al. 2005 x 80%

> –30 –10 x x x x x Flipphi et al., 2009 x 1e 1e

40% > > comparative

–05 in x x x x x x x x x

Fedorova et al., 2008 1e

> –50 x x x David et al., 2008 x 1e

> applied

–30 x x x Coutinho et al., 2009 x 1e

> x x x Braaksma et al. 2010 x 70% 40% , 80%

> > > parameters

–05 x x x x x x Ballester et al. 2015 x , 50% 1e

and > >

x x x x x x x Arnaud et al., 2010 x –75 x x x Andersen et al., 2011 x 1e identifcation methods

x x x Andersen et al., 2008 x x of

feaure

–30 x x x Agren et al. 2013 x x 50% , 40% 1e

200aa

> > > >

clustering

Overview

(I)

VII) length length aa ,

of (V) comp. mutual (V

hit

model 25%

4.3 50 VI)

,

pipeline

(IX) study (VIII)

alignment ≥ (V hits hit

cutof nucleotide

score II) score (IV) of (

(III) Bidirectional E-value BLAT MUMmer PASA GATK AI BLASTn E–value MUSCLE SOAP Sybil SAMtools Bit Alignment Alignment coverage Alignment coverage shortest Alignment Bit BLASTx AL BWA identity best Modifed best BLASTp Sequence similarity tBLAST Homology/orthology BLAST Core Clade/group Specialized genomics Focus Strain Type Genome-scale metabolic Database Genome sequenced Alignment-based InParanoid E-value Table Genus Penicillium Aspergillus Match CAFE Strain/clade/core-specifc Synteny Single polymorphism TRIBE-MCL OrthoMCL 29 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P x Database x x x x x x x x x x x x x x Local installation x x x x x x x x x x Online x x x x x x x x x x Visualization x x x x Toolbox x x x x x Wang et al., 2014 x x x x x x x Vongsangnak et al. 2008 x , 40%

200aa

> > x x x van den Berg et al., 2008 x –05 x x x x x x x Rokas et al., 2007 x 1e

> –10 x x x x x x Pi et al. 2015 x 50% , 50% 1e

> > > –10 aspergilli x x x x x x Pel et al., 2007 x 1e

100aa

> > and

–05 x x x x x x x Marcet-Houben et al. 2012 x , 50% 1e

> > x x x x x 100 100

Machida et al., 2005 penicillia > > of

–30 x x x x J. Liu et al. 2013 x , 40% 1e

> > –10 –05 x x x x x x x studies G. Liu et al. 2013 50% , 60% 1e 1e

> > > > x x x x x x x Kis-Papo et al., 2014 x x x Inglis et al., 2013 x x x x genomics Gibbons and Rokas, 2009 x x x x x x Galagan et al. 2005 x 80%

> –30 –10 x x x x x Flipphi et al., 2009 x 1e 1e

40% > > comparative

–05 in x x x x x x x x x

Fedorova et al., 2008 1e

> –50 x x x David et al., 2008 x 1e

> applied

–30 x x x Coutinho et al., 2009 x 1e

> x x x Braaksma et al. 2010 x 70% 40% , 80%

> > > parameters

–05 x x x x x x Ballester et al. 2015 x , 50% 1e

and > > x x x x x x x Arnaud et al., 2010 x –75 x x x Andersen et al., 2011 x 1e identifcation methods

x x x Andersen et al., 2008 x x of

feaure

–30 x x x Agren et al. 2013 x x 50% , 40% 1e

200aa

> > > >

clustering

Overview

(I)

VII) length length aa ,

of (V) comp. mutual (V

hit

model 25%

4.3 50 VI)

,

pipeline

(IX) study (VIII)

≥ alignment ≥ (V hits hit

cutof nucleotide

score II) score (IV) of (

(III) Bidirectional E-value BLAT MUMmer PASA GATK AI BLASTn E–value MUSCLE SOAP Sybil SAMtools Bit Alignment Alignment coverage Alignment coverage shortest Alignment Bit BLASTx AL BWA identity best Modifed best BLASTp Sequence similarity tBLAST Homology/orthology BLAST Core Clade/group Specialized genomics Focus Strain Type Genome-scale metabolic Database Genome sequenced Alignment-based InParanoid E-value Table Genus Penicillium Aspergillus Match CAFE Strain/clade/core-specifc Synteny Single polymorphism TRIBE-MCL OrthoMCL 30 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P x Database x x x x x x x x x x x x x x x x x x x x x x x x x x x Local installation x x x x x x x x x x x x x x x x x x x x x x x Online x x x x x x x x x x x x Visualization x x x x x x x Toolbox x

Wang et al., 2014

Vongsangnak et al., 2008 x x x van den Berg et al. 2008 x x x x , 95%

>

Rokas et al., 2007 x –10 –10 x x x Pi et al. 2015 x 80% , 50% 50% 80% 1e 1e

> > > > > > x x x Pel et al., 2007 x –05 x x x x x x x x x Marcet-Houben et al. 2012 x , 50% 1e

> >

Machida et al., 2005 x

J. Liu et al., 2013 x –10 x x x x G. Liu et al. 2013 x , 60% 1e

> > x x x x Kis-Papo et al., 2014 x

Inglis et al., 2013 x x x Gibbons and Rokas, 2009 x

Galagan et al. 2005 x x x x Flipphi et al., 2009 x –04 x x x x x x x x x x Fedorova et al., 2008 x 1e

>

David et al., 2008

Coutinho et al., 2009 x

Braaksma et al., 2010 0.43

> –05 –05 x x x x x x x x Ballester et al. 2015 x , 50% 50% 50% 1e 1e

> > > > >

Arnaud et al., 2010

Andersen et al., 2011 x

Andersen et al., 2008 x - Agren et al., 2013 x

% (XIX) Continued

(XIII) of

or

(query)

aa (XVIII)

number cluster

aa (X) (XVII) genes Repeats

when (XV)

(XIV)

length (XVI) 4.3 (XII)

in

(XI) found

introns

DIALIGN-TX M-Cofee MUSCLE [DNA/aa] Kalign MAFFT ClustalX RaxML PhyML/BioNJ E-value Gblocks Cosmo SMURF Identical of Similar GeneDoc ClustalW TMHMM Multi-lagan RSAT E-value PSORT BLASTp Repeat-Masker RepeatScout Tandem Finder Transposon-PSI tBLASTn Emboss etandem Alignment identity Alignment coverage Alignment coverage Present genes was Secretome SignalIP Alignment identity Alignment coverage TRANSFAC BLASTp Evolutionary clustering Secondary metabolism Alignment Table Non-coding repeats Phylogeny Idenifcation conserved 31 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P x Database x x x x x x x x x x x x x x x x x x x x x x x x x x x Local installation x x x x x x x x x x x x x x x x x x x x x x x Online x x x x x x x x x x x x Visualization x x x x x x x Toolbox x

Wang et al., 2014

Vongsangnak et al., 2008 x x x van den Berg et al. 2008 x x x x , 95%

>

Rokas et al., 2007 x –10 –10 x x x Pi et al. 2015 x 80% , 50% 50% 80% 1e 1e

> > > > > > x x x Pel et al., 2007 x –05 x x x x x x x x x Marcet-Houben et al. 2012 x , 50% 1e

> >

Machida et al., 2005 x

J. Liu et al., 2013 x –10 x x x x G. Liu et al. 2013 x , 60% 1e

> > x x x x Kis-Papo et al., 2014 x

Inglis et al., 2013 x x x Gibbons and Rokas, 2009 x

Galagan et al. 2005 x x x x Flipphi et al., 2009 x –04 x x x x x x x x x x Fedorova et al., 2008 x 1e

>

David et al., 2008

Coutinho et al., 2009 x

Braaksma et al., 2010 0.43

> –05 –05 x x x x x x x x Ballester et al. 2015 x , 50% 50% 50% 1e 1e

> > > > >

Arnaud et al., 2010

Andersen et al., 2011 x

Andersen et al., 2008 x - Agren et al., 2013 x

% (XIX)

(XIII) of or

(query)

aa (XVIII)

number cluster

aa (X) (XVII) genes Repeats

when (XV)

(XIV)

length (XVI) (XII)

in

(XI) found

introns

DIALIGN-TX M-Cofee MUSCLE [DNA/aa] Kalign MAFFT ClustalX RaxML PhyML/BioNJ E-value Gblocks Cosmo SMURF Identical of Similar GeneDoc ClustalW TMHMM Multi-lagan RSAT E-value PSORT BLASTp Repeat-Masker RepeatScout Tandem Finder Transposon-PSI tBLASTn Emboss etandem Alignment identity Alignment coverage Alignment coverage Present genes was Secretome SignalIP Alignment identity Alignment coverage TRANSFAC BLASTp Evolutionary clustering Secondary metabolism Alignment Non-coding repeats Phylogeny Idenifcation conserved 32 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P

x x x x x x Database x

x x x x x x x x x x x

Local installation and

domain R

prediction

x x x x x x x x x x

http:// Online x* , Peer (Bembom

An Alignment

x x x x x x x x x x x x

Visualization De

MUMmer 2000);

Haas (XII)

, location - x x x x x x Toolbox (Van Cosmo

Tool;

2003); x x Wang et al., 2014 (Brian

, . transmembrane Sybil protein

al

2007);

(http://www.repeatmasker.

, . et (Castresana

x x x

Vongsangnak et al., 2008 al Burrows–Wheeler (XVII)

Analysis

(XXIII) 2008);

et

, . (VI)

al tool;

(Brudno

Gblocks et x

van den Berg et al., 2008 Transposon-PSI (Larkin

(Li Sequence

Repeat-Masker

mapping; 2010);

visualization;

, prediction 2000); . Rokas et al., 2007 , SOAP .

al and

Multi-lagan

tree

2014); al

et ,

ClustalW/X

et Regulatory

2010); x x x x x location (XXII) Pi et al. 2015 , 50% , 1994);

. 1000 (XI) , >

. al 2006);

detection al

,

. et

(McKenna et al

(Stamatakis

x x x x x (Wingender Pel et al., 2007 et

1000 likelihood; subcellular

sequence; variation GATK (Khaldi

Bie

(Kumar

RaxML

(XVI)

(De x x x x x x x Marcet-Houben et al., 2012 100 2009); protein

, SMURF maximum TRANSFAC MEGA a

2007);

alignment , CAFE

.

in

Machida et al., 2005 al

predictor; (XXI) for

2007); et 2007); 2006);

, , (V) , . .

2009); . (Rambaut

, al al al

repeats

x x J. Liu et al., 2013 et et et

for (Horton

sequence

Durbin FigTree

(X)

Evolution; construction;

x x x x x x G. Liu et al., 2013 x and 1000

signal (Wallace

PSORT tree

2010);

(Li

, . Family x x x x (XV) Kis-Papo et al., 2014

al Package; (Emanuelsson (Emanuelsson

2011); BWA et x x x

Inglis et al., 2013 , gene

M-Cofee

likelihood Finder; of

x x Gibbons and Rokas, 2009 SignalP Analysis

2002);

TMHMM

, 2002); , x x . (Criscuolo

Galagan et al. 2005 2003).

al 2009); 1000 Regions , Analysis (Kent

.

, 1999); maximum .

al et , (Huerta-Cepas

al

et

(XX) et PhyML x x x BLAT

Flipphi et al., 2009 (Li 1000

(Li

(Katoh Toolkit Oligonucleotide

Unknown

(Benson

2000); 2009); , ETE , x x x

Fedorova et al., 2008 . Computational Short

al

MAFFT Finder

OrthoMCL et SAMtools (IV) construction;

(IX)

(Retief 2000);

Metabolite

, x David et al., 2008 . tree 2005);

al

, 2002); 2008); Repeats length; ,

et Phylip ,

Toolkit; .

.

al al

(Camacho

Coutinho et al., 2009 x et et

1000

(Rice Secondary

2003); Tandem ,

Analysis

alignment BLAST

(XIV)

Sonnhammer

Braaksma et al., 2010 (Enright (III)

(Haas

Emboss

2012); neighbour-joining

, and .

al Genome

follows:

(XIX) x x x x x PASA

et Ballester et al., 2015 predictor; 100 2010);

as identity;

, (Thomas-Chollier

. (VIII)

al site

TRIBE-MCL

x Riley

Arnaud et al., 2010 2004); et (Lassmann

,

tool; RSAT

alignment;

x x x x 2007; Andersen et al., 2011 alignment referenced

binding 1000 ,

. Kalign (Edgar

TreeCon;

(II) 2005);

al

are ,

x Andersen et al., 2008 . et Alignment

al

factor

2005); et (Subramanian consensus

table

,

. - x x x x MUSCLE al

Agren et al., 2013

for

this reciprocal;

et

(Price (Crabtree

(I) in

Sequence

(XVIII)

acid.

in

1999a); Continued (XI)I

transcription

,

. 1994) DIALIGN-TX

(XXII) ML , (O’Brien al

related HGT

annotation

tool;

(XIX) request. (XIII) included Multiple found absent

related (XXII) et 4.3

amino

for (XIX)

(XX)

, (XXIII) toolkit 2007); per

RepeatScout annotation

(VII) aa , closely

.

Wachter

al Cluster sequence identity ETE Cluster distant species Cluster in species TreeCon FigTree Phylip MEGA Bootstrap

CAZy CELLO KEGG Visualization InterPro AntiSMASH BLASTp Pfam InParanoid (Delcher org/); De transposonpsi.sourceforge.net); Table Functional GO *Only tool; package; prediction tool. Methods et 33 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P

x x x x x x Database x

x x x x x x x x x x x

Local installation and

domain R

prediction

x x x x x x x x x x

http:// Online x* , Peer (Bembom

An Alignment

x x x x x x x x x x x x

Visualization De

MUMmer 2000);

Haas (XII)

, location - x x x x x x Toolbox (Van Cosmo

Tool;

2003); x x Wang et al., 2014 (Brian

, . transmembrane Sybil protein

al

2007);

(http://www.repeatmasker.

, . et (Castresana

x x x

Vongsangnak et al., 2008 al Burrows–Wheeler (XVII)

Analysis

(XXIII) 2008);

et

, . (VI)

al tool;

(Brudno

Gblocks et x van den Berg et al., 2008 Transposon-PSI (Larkin

(Li Sequence

Repeat-Masker

mapping; 2010);

visualization;

, prediction 2000); . Rokas et al., 2007 , SOAP .

al and

Multi-lagan

tree

2014); al

et ,

ClustalW/X

et Regulatory

2010); x x x x x location (XXII) Pi et al. 2015 , 50% , 1994);

. 1000 (XI) , >

. al 2006);

detection al

,

. et

(McKenna et al

(Stamatakis

x x x x x (Wingender Pel et al., 2007 et

1000 likelihood; subcellular

sequence; variation GATK (Khaldi

Bie

(Kumar

RaxML

(XVI)

(De x x x x x x x Marcet-Houben et al., 2012 100 2009); protein

, SMURF maximum TRANSFAC MEGA a

2007);

alignment , CAFE

.

in

Machida et al., 2005 al

predictor; (XXI) for

2007); et 2007); 2006);

, , (V) , . .

2009); . (Rambaut

, al al al

repeats

x x J. Liu et al., 2013 et et et

for (Horton

sequence

Durbin FigTree

(X)

Evolution; construction;

x x x x x x G. Liu et al., 2013 x and 1000

signal (Wallace

PSORT tree

2010);

(Li

, . Family x x x x (XV) Kis-Papo et al., 2014

al Package; (Emanuelsson (Emanuelsson

2011); BWA et x x x

Inglis et al., 2013 , gene

M-Cofee

likelihood Finder; of

x x Gibbons and Rokas, 2009 SignalP Analysis

2002);

TMHMM

, 2002); , x x . (Criscuolo

Galagan et al. 2005 2003).

al 2009); 1000 Regions , Analysis (Kent

.

, 1999); maximum .

al et , (Huerta-Cepas

al

et

(XX) et PhyML x x x BLAT

Flipphi et al., 2009 (Li 1000

(Li

(Katoh Toolkit Oligonucleotide

Unknown

(Benson

2000); 2009); , ETE , x x x

Fedorova et al., 2008 . Computational Short

al

MAFFT Finder

OrthoMCL et SAMtools (IV) construction;

(IX)

(Retief 2000);

Metabolite

, x David et al., 2008 . tree 2005);

al

, 2002); 2008); Repeats length; ,

et Phylip ,

Toolkit; .

.

al al

(Camacho

Coutinho et al., 2009 x et et

1000

(Rice Secondary

2003); Tandem ,

Analysis

alignment BLAST

(XIV)

Sonnhammer

Braaksma et al., 2010 (Enright (III)

(Haas

Emboss

2012); neighbour-joining

, and .

al Genome

follows:

(XIX) x x x x x PASA

et Ballester et al., 2015 predictor; 100 2010);

as identity;

, (Thomas-Chollier

. (VIII)

al site

TRIBE-MCL

x Riley

Arnaud et al., 2010 2004); et (Lassmann

,

tool; RSAT

alignment;

x x x x 2007; Andersen et al., 2011 alignment referenced

binding 1000 ,

. Kalign (Edgar

TreeCon;

(II) 2005);

al

are ,

x Andersen et al., 2008 . et Alignment

al

factor

2005); et (Subramanian consensus

table

,

. - x x x x MUSCLE al

Agren et al., 2013

for

this reciprocal;

et

(Price (Crabtree

(I) in

Sequence

(XVIII)

acid. in

1999a); (XI)I

transcription

, . 1994) DIALIGN-TX

(XXII) ML , (O’Brien al related HGT

annotation

tool;

(XIX) request. (XIII) included Multiple found absent

related (XXII) et

amino

for (XIX) (XX)

, (XXIII) toolkit 2007); per

RepeatScout annotation

(VII) aa , closely

.

Wachter

al Cluster sequence identity ETE Cluster distant species Cluster in species TreeCon FigTree Phylip MEGA Bootstrap

CAZy CELLO KEGG Visualization InterPro AntiSMASH BLASTp Pfam InParanoid (Delcher org/); De transposonpsi.sourceforge.net); Functional GO *Only tool; package; prediction tool. Methods et 34 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 56 | Nybo et al.

should be cautious with using e-values as sole selection criterion, especially when compar- ing across studies, as it is directly derived from database size and is biased towards long sequence matches and sensitive to data bias. Tis is particularly challenging in fungi, as there are very large proteins (e.g. polyketide synthases), which may have partial matches to each other, which are longer than some full-length proteins. For this reason, the main motivation for e-value cutofs should be to reduce the number of false/random hits, and not to select true orthologues. Tis is best achieved by bidirectional hits in tandem with alignment cover- age and identity. Tis also comes with a note of caution. In particular alignment identity is dependent on phylogenetic distance of the species investigated. For this reason, it is not pos- sible to recommend a single set of parameters for BLAST for orthologue detection. Instead we suggest examining the data set and species frst, and choose parameters based on the values generated by known orthologues. Typically used parameter setings are described in Table 4.3.

Orthology case studies A primary application of the identifcation of putative orthologues is the transfer of anno- tation information between genomes. In a recent example, the genome of A. ustus was sequenced by Pi et al. (2015) and the potential protein-coding sequences were found using AUGUSTUS sofware. Tese sequences were primarily annotated by homology search against the NCBI non-redundant (nr) protein database (www.ncbi.nlm.nih.gov) using BLASTp with the selection criteria (e-value < 1e–03, alignment identity > 25%, query align- ment coverage > 50%). Teir gene ontology annotations were transferred to A. ustus when bidirectional best hits between A. ustus and A. nidulans were fting the strict parameters (e-value < 1e–10, alignment identity > 50%, query alignment coverage > 50%) (Pi et al., 2015). Orthologues have also been used in evolutionary studies. Phylogeny has ofen been predicted from a single or few conserved genes, but depending on the taxonomic span, the resolution might need to include numerous genes. An approach could be to fnd all the orthologues shared among the compared species (the core genome) and use these to generate the phylogenetic relation between the species. Investigating shared genes can also give insight into evolutionary development such as horizontal gene transfer and selective pressure. In a study of pathogenic flamentous fungi, the evolutionary relations between seven Aspergillus strains were generated by pairwise comparing genomes using bidirectional best BLASTp hits with an e-value < 1e–05. From the core genome, 90 genes were selected on the basis of similar lengths and identical number of intron/exon structures. From this analysis Fedorova et al. (2008) additionally found genes that potentially were horizontally trans- ferred between Chaetomium globosum and A. fscherianus (Neosartorya fscheri).

Whole-genome comparisons and synteny analysis Whole genome-to-genome comparison is a highly desired analysis, which includes the study of genetic variations in genes and non-coding regions, chromosomal rearrangements and genetic co-localization (synteny). Although, the algorithms used today are not ideal as the number of genomes increases (comparison of a large number of genomes is in essence a multidimensional problem) studies with 3–5 whole genome comparisons have not been uncommon (e.g. Fedorova 35 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 57 et al., 2008; Galagan et al., 2005; Rokas et al., 2007). Te main issue with increasing the number of genomes is computational time, which increases exponentially with the number of genomes in the comparison, and poses challenges with visualizing the output. Should new algorithms solve this issue, it will allow us to efciently study more complex dynamics of fungal genomes. Synteny is commonly addressed using sequence alignments, comparing the localization of genes in one genome with the positions of similar genes in another genome. Te align- ments are ofen performed using BLASTp, but other more specialized algorithms as SYBIL (Crabtree et al., 2007; Riley et al., 2012) and MUMmer (Delcher et al., 1999a) have been used (Table 4.3). MUSCLE (Edgar, 2004), MAFFT (Katoh et al., 2002), CLUSTAL X/W (Larkin et al., 2007), and Kalign (Lassmann and Sonnhammer, 2005) are all multiple sequence alignment tools optimized for either long DNA or protein sequences which have been applied in fungal phylogeny research. Additionally, M-Coffee has an extra unique feature where it combines all alignments from other tools generating one consensus align- ment (Wallace et al., 2006).

Case examples

Synteny case studies When the genome of P. chrysogenum was released, van den Berg et al. (2008) aligned P. chry- sogenum to A. nidulans, A. niger, A. fumigatus, and A. oryzae showing that reshufing of the chromosomes has occurred afer divergence of the Aspergillus and Penicillium lineages. In particular, they pairwise aligned the Aspergillus contigs larger than 100 kb to the 14 largest supercontigs of P. chrysogenum using the Promer program of the MUMmer package (van den Berg et al., 2008). Tese diferent cases provide insight to what comparative genomics can be used for. Even if the methods are based on the tools used for single genome analysis, comparing genomes with each other provides the means to study species evolution, gene and chromosomal diversity, physiological traits and novel as well as horizontally transferred genes. Further- more, comparative genomics provides the opportunity of combining evidence of single genome studies, adding strength to knowledge deduced from the analysis.

Primary metabolism Primary metabolism can be defned as the set of metabolic reactions directly supporting the propagation of a species by providing precursors for growth and reproduction. Primary metabolism of flamentous fungi is of great interest due to the intriguing phenotype of over- producing large quantities of di- or tri-carboxylic acids derived from intermediates of the citric acid cycle in diferent species of Aspergillus, as well as providing favour compounds for food production and precursor molecules for production of bioactive compounds from Penicillium species. Examples of industrially interesting processes include the production of citric acid by A. niger (Karafa and Kubicek, 2003) or production of itaconic acid by A. terreus (Okabe et al., 2009). In order to understand these inter- and intraspecifc diferences, comparative genomics has been applied as a tool for examining the genomic basis of difer- ent phenotypes, e.g. the citric acid-producing strain ATCC 1015 and protein overproducing strain CBS 513.88 of A. niger (Andersen et al., 2011). Diferences in copy number and 36 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 58 | Nybo et al.

predicted functions can be correlated to productivity and therefore represent an invaluable tool for genetic engineering and target identifcation. Primary metabolism has been exam- ined on an individual pathway basis in the past, identifying and characterizing the function of individual genes involved in the diferent pathways.

General comparisons of primary metabolism Flipphi et al. (2009) used the genomes of eight Aspergillus species for the comparison of the primary metabolism genes in these organisms. Using this dataset, the authors could identify gene duplications at several steps of primary metabolism indicating the potential of these species to adapt to their individual niches. By comparing multiple genomes the evolution of individual genes can be reconstructed and gene duplication events narrowed down in time.

Genome-scale metabolic models as sources for comparative genomics Another way of using the wealth of information provided by fungal sequencing eforts is the construction of genome-scale metabolic models. Similar to genome sequencing, where the complete list of genes of a species is enumerated, genome-scale models aim at assembling the complete set of metabolic reactions for a selected species. Due to the unsolved challenge of characterizing metabolic enzymes in a high-throughput manner, understanding diferences in primary metabolic phenotype between individual strains or species, relies on external information. In the classical bioinformatics approach, domain and functional predictions are based on sequence motifs and structural properties of the gene products. In addition to this approach, comparative genomics also aims at infer- ring information of uncharacterized genes from similar characterized genes in a close neighbouring species. Comparative genomics is therefore frequently used during the construction of genome-scale models or in order to fll gaps in known pathways as well as for the construction of less well-characterized organism. Tere are multiple genome-scale metabolic reconstructions of flamentous fungi available (Table 4.3) and their usage has been reviewed recently (Brandl and Andersen, 2015). Many of these models have been reconstructed using previously published models as templates or source of metabolic reactions (Fig. 4.1). Here, the relatively well-annotated fungal genome of S. cerevisiae has served as an initial source of information to generate the frst genome-scale models of Aspergillus species, which has then propagated to other Aspergilli, and a recent model for P. chrysogenum. Tis use of comparative genomics propagates biological evidence across species thereby increasing the level of experimental support of the individual models. However, atention should be paid to the possibility of propagating errors between diferent reconstructions as well as to a potential ‘dilution’ of the experimental backing. Tis dilution arises from the fact that gene–function associations are inferred from similarities to a gene in another organism. A cutof has to be chosen for these comparisons in order to classify gene comparisons as meaningful. By chaining these comparisons the experimental evidence is diluted, as new sequences are no longer compared to the original source but to an inferred gene function itself. Comparative genomics is not the only source for model generation, as manual reconstruc- tions rely on experimental evidence and bibliomic data, but it is heavily used by algorithms for the automatic generation of genome-scale models (e.g. as described by Pitkänen et al., 37 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 59

Figure 4.1 Information transfer between diferent fungal genome-scale models. Propagation of information between the genome-scale models of S. cerevisiae (Förster et al., 2003), A. niger (Andersen et al., 2008a; David et al., 2003), A. nidulans (David et al., 2008), A. oryzae (Vongsangnak et al., 2008), A. terreus (Liu et al., 2013b) and P. chrysogenum (Agren et al., 2013).

2014). Te combined network from multiple organisms is a strong tool for rapid tentative annotation of primary metabolism in new sequenced organisms.

Secondary metabolism Comparative genomics is a powerful tool to analyse secondary metabolite producing gene clusters across fungal genomes. In several cases horizontal gene transfer of polyketide synthases (PKS) from bacteria to flamentous fungi has been observed, as well as transfer of gene clusters between fungal species, enabling the fungus to create new compounds by implementation of other genes in the cluster or exchanging cluster members, e.g. in the case of afatoxin and sterigmatocystin (Osbourn, 2010). Comparison of gene sequences also led to the identifcation of new gene clusters, e.g. the afatrem genes in A. favus have been characterized due to their homology to paxilline from P. paxilli (Nicholson et al., 2009), which is an elegant example of how slight modifcations in sequence can yield another compound. Tere are also examples of plant genes, which have been characterized based on fungal sequence information. Te cluster coding for helvolic acid synthesis was frst identifed in A. fumigatus, and then cloned into S. cerevisiae verifying helvolic acid production. Te sequence of the helvolic acid gene cluster was used to deter- mine the biosynthetic genes in plants, shedding light on biosynthetic pathways (Mitsuguchi et al., 2009). Several examples of similar types have been reviewed recently (Sanchez et al., 2012). Using comparative genomics, one can either search manually (e.g. using sequence alignments) for conserved backbone proteins or their domains in combination with their accessory tailoring genes. Alternatively, one can use predictor tools at the genome level in combination with sequence databases to compare gene clusters throughout organisms. 38 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 60 | Nybo et al.

Table 4.4 presents an overview of such databases and predictors, which are relevant for fungi. A more comprehensive overview also including bacterial tools can be found in Weber et al. (2014). In general, Table 4.4 demonstrates that there are multiple algorithms to choose from, if one wants to comparatively investigate secondary metabolism in fungi. In particular, the predictors can be divided into identifying gene clusters and predicting the specifcity of synthases and synthetases in the gene clusters.

Prediction of secondary metabolite gene clusters In general, there are two main methods applied within the feld of predicting second- ary metabolite clusters: the Antibiotics and Secondary Metabolite Analysis SHell (AntiSMASH) and the Secondary Metabolite Unique Regions Finder (SMURF) (Khaldi et al., 2010; Weber et al., 2015). AntiSMASH has recently been upgraded to version 3.0 and optimized to perform faster on large scale projects. Te algorithm was built using a profle Hidden Markov Model

Table 4.4 Overview of databases of secondary metabolites, their genes and predictors for fungal secondary metabolite gene clusters Name URL Reference Organism Focus1

Databases Clustermine 360 http://www.clustermine360.ca/ Conway and Fungi, Clusters Boddy, 2013 bacteria Norine http://bioinfo.lif.fr/norine/ Caboche et al., Fungi, Compound 2008 bacteria Novel Antibiotics http://www0.nih.go.jp/~jun/ Fungi, Compound Database NADB/search.html bacteria PubChem https://pubchem.ncbi.nlm.nih. Fungi, Compound gov/ OTHER

Prediction of secondary metabolite gene clusters antiSMASH3 http://antismash. Weber et al., Fungi, Clusters secondarymetabolites.org/ 2015 bacteria SMURF http://jcvi.org/smurf/index.php Khaldi et al., Fungi, Clusters 2010 bacteria

Prediction of substrate specifcity antiSMASH3 http://antismash. Weber et al., Fungi, PKS, secondarymetabolites.org/ 2015 bacteria NRPS MAPSI/ASMPKS http://gate.smallsoft.co.kr:8008/ Park, 2009 Fungi, PKS pks/ bacteria NRPSpredictor2 http://nrps.informatik. Röttig et al., Fungi, NRPS uni-tuebingen.de/ 2011 bacteria Controller?cmd=SubmitJob LSI-based http://bioserv7.bioinfo. Baranašić et al., Bacteria, NRPS A-domain function pbf.hr/LSIpredictor/ 2014 Fungi predictor AdomainPrediction.jsp

1PKS, ; NRPS, non-ribosomal peptide synthetase. 39 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 61

(pHMM) on mostly bacterial but also fungal clusters from the NCBI protein database. Once a cluster protein domain (including PKS, synthases (NRPS), terpenes, aminoglycosides, aminocoumarins, indolocarbazoles, lantibiotics, bacteriocins, nucleosides, siderophores or melanin) is found, the program determines accessory genes, by scanning the genome in a 10 kb range and then searches outward by 5, 10 or 20 kb depend- ing on the cluster type. Tis is highly efcient, but an observed problem with this method is that clusters are predicted to have too many genes included (as commented by Inglis et al., 2013). It rarely underpredicts. A new feature of version 3 is the implementation of Cluster- Finder (Cimermancic et al., 2014), which uses a HMM algorithm based on 732 bacterial gene clusters to determine gene clusters of unknown type, flling a gap in the AntiSMASH functionality. However, the training set only consisted of bacterial gene clusters and the performance of this algorithm on fungal sequences still has to be evaluated. SMURF was launched prior to AntiSMASH, and works in a similar fashion, but only fungal conserved domains were explored to build a HMM algorithm, thus giving it a more specifc focus. It also examines a narrower set of cluster domains compared to AntiSMASH, namely dimethyl-allyltransferases including prenyltransferases, hybrids (combinations of PKS and NRPS domains), PKS, PKS-like, NRPS, NRPS-like, terpenecyclases. Once all backbones are found, the algorithm searches for members of 27 secondary metabolite- defning domains within a frame of 20 genes. Te user can set intergenic distance between genes in the cluster and the maximum number of secondary metabolite domain negative genes. In general, AntiSMASH has become sort of a de facto standard for secondary metabolite gene cluster predictions, and has developed a user-friendly web interface. Tis should be applied if one is interested in comparing gene clusters in a given genome. However, given that SMURF is trained on a fungal-specifc set, and has a runtime that is signifcantly below AntiSMASH, we recommend that both algorithms are applied and compared if possible.

Prediction of substrate specificity With gene clusters identifed, one can use a variety of tools to predict the chemicals gen- erated by such clusters (Table 4.4). AntiSMASH (see section on ‘Prediction of secondary metabolite gene clusters’) includes functions to predict the specifcity of both PKSs and NRPSs and is possibly the go-to method at the moment. As an alternative for NRPSs, NRP- Spredictor2 (Rötig et al., 2011) uses support vector machines to predict NRPS adenylation domain specifcity. Te majority of the training set consisted of bacterial gene clusters, while only 16% were of fungal origin. Fortunately, a predictor for fungal gene clusters was added to make investigation of fungal sequences possible.

Supplementing methods for cluster identification A few methods have been published, which use alternative methods for identifying gene clusters. In particular, (Inglis et al., 2013) used comparative genomics between A. nidulans, A. fumigatus, A. niger, A. oryzae to refne the borders of gene clusters returned by antiSMASH and SMURF. Andersen et al. (2013) designed an algorithm for the use of transcriptomics data in tandem with genome sequences to determine gene cluster boundaries. A beneft of this method is that it can identify superclusters, i.e. clusters on diferent chromosomal loci, such as austinol, dehydroaustinol and prenyl xanthones. Additionally, it is not limited by the annotation of the genes, which causes some gene members to be lef out in other 40 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 62 | Nybo et al.

algorithms. Te MIDDAS-M algorithm (Umemura et al., 2013) ofers the implementation of a transcriptomics-driven method using the same principles as the one of (Andersen et al., 2013).

Examples of comparative genomics in secondary metabolism Until now, comparative studies have shown that only few clusters are conserved in flamen- tous fungi. Khaldi et al. (2010) showed that only fve secondary metabolite gene clusters involved in protection against oxidative stress are conserved between the closely related spe- cies A. fumigatus Af293, A. clavatus and A. fscherianus (Neosartorya fscheri), implying that the other clusters are species-specifc. Tis study showed the power of comparative genom- ics to reveal common traits of species. However, in this analysis combined with the work by (Lind et al., 2015), one should note that the relatively large diversity in the compared species may underestimate the number of shared clusters. In the future, including new genomes to the analysis could redefne the ‘core’ clusters and reveal additional common traits. Chiang et al. (2010) employed comparative genomics and protein domain predictions to compare multiple types of PKSs to infer shared traits and modular functions of the proteins, thus identifying shared traits across species. Te study by Inglis et al. (2013), mentioned above, has been a frst approach to process gene clusters in fungi at larger scale. Tey identifed a total of 261 non-redundant clusters in multiple species using their predicted gene clusters refned by alignment. Tis is an excellent example of how comparative genomics can complement the quality of automated predic- tion and can accelerate the research on secondary metabolites in the future.

Secretome analysis and plant polysaccharide degradation Given the role of flamentous fungi as prolifc producers of extracellular enzymes, it is not surprising that a number of studies have employed comparative genomics to investigate the degradation potential of diferent species. Te applications span both the proteases and polysaccharide degrading enzymes. A general investigation has been performed by (Braaksma et al., 2010). Here, the authors atempted to study the secretome of A. niger, and did this by comparing the genome sequences of two A. niger strains to A. oryzae, A. fumigatus and A. nidulans to identify all theoretically secreted proteins. Tis has been combined with a proteomics analysis and allowed a refnement of the identifcation of secreted proteins from A. niger. Examining proteases has been in a study spanning seven Aspergillus species (Ozturkoglu Budak et al., 2014). Te authors performed an extensive study of proteases using compara- tive genomics, comparative proteomics, and protease assays. Te study confrmed that the aspergilli examined have protease capacities of similar magnitude, and that the diferent spe- cies have divergent profles of proteases dependent on the growth medium, thus suggesting specialized life styles. Polysaccharide degradation potential is a trait of interest in most species, and has been uncovered in many of the recent genome publications (Ballester et al., 2015; Marcet- Houben et al., 2012; Pel et al., 2007). For a more specialized investigation, Coutinho et al. (2009) examined the polysaccharide degradation potential of three Aspergillus species in detail, using a combination of CAZy predictions (Cantarel et al., 2009; Lombard et al., 2014), comparative genomics, and growth assays on polysaccharides. Furthermore, the 41 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 63 authors included a comparison to Podospora anserina. Te main discovery in this study was the fnding that the fungi had relatively similar degradation potential, but the actual degrada- tion profles of the fungi were diferent, possibly due to diferential regulation and natural ecological niches.

Other comparative ‘omics’ In the context of comparative genetics, it can be quite informative to also examine other types of comparative omics and examine what can be learned from these approaches. In this case, we see comparative omics as performing the same type of ‘omics’ analysis in multiple species and comparing the results using an identifcation of orthologues between the spe- cies. As the genome is used for orthologue determination (see ‘Identifcation of orthologues across species’, above), it can be seen as an additional dimension added to the comparative genomics data by an experimental analysis. Some studies of this type have been performed, in particular comparative transcriptomics and proteomics.

Comparative transcriptomics Comparative transcriptomics is particularly interesting, when one wishes to examine and identify conserved transcriptional regulation across species. To our knowledge the frst example of this in Aspergillus and/or Penicillium was by Andersen et al. (2008), where the regulatory response to growing on l-xylose relative to d-glucose was examined. Te authors carried out DNA microarray experiments for A. niger, A. nidulans and A. oryzae, identifed a set of genes with the same transcriptional response in all three examined species, and a common regulatory motif for binding of the xylanolytic transcriptional regulator (XlnR). Salazar et al. (2009) performed a similar study using glucose and glycerol as the carbon sources, which gave interesting fndings on the fungal response to osmotic stress. It was seen that the Aspergillus response to glycerol-induced osmotic stress has conserved elements and regulatory motifs to similar systems in S. cerevisiae and humans. Te study by Coutinho et al. (2009), which we discussed in the section on ‘Secretome analysis and plant polysaccharide degradation’, also employs comparative transcriptomics based on the data set of Andersen et al. (2008b) to study the secretory response and regula- tion, concluding a diferential regulation of secreted polysaccharide-degrading enzymes. Te responses to hypoxia in A. oryzae and A. nidulans have been studied by Terabayashi et al. (2012) by comparing DNA microarray experiments of A. oryzae with available data for A. nidulans. Cultures were performed under hypoxic conditions and identifed both conserved and species-specifc responses to hypoxia.

Comparative proteomics In this context, it is important to defne that we here see ‘comparative proteomics’ as the comparison of transcriptomic results from multiple species or diverse strains, and not, as is also seen in the literature, the comparison of proteomics from multiple conditions [e.g. Stoll et al. (2014) or other notable papers as also reviewed by Kniemeyer (2011)]. Within our defnition of comparative proteomics, an interesting aspect is, that this approach allows identifcation of conserved responses as well as unique traits across mul- tiple species. 42 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 64 | Nybo et al.

One such example in the area of immunogenicity, is the work of Benndorf et al. (2008), who studied allergens from spores of A. versicolor, A. fumigatus, and P. expansum by per- forming 2D gels in combination with immunobloting to identify allergenic proteins in individual and multiple species. A very comprehensive study of secreted proteases in seven diferent Aspergillus species has been conducted (Ozturkoglu Budak et al., 2014). Here, the authors mine the genomes for putative proteases, employ comparative genomics to identify shared and unique pro- teases, and followed up by conducting proteome profling on the secreted proteins from all seven species. Results were further validated with enzymatic protease assays. Te analysis identifed highly conserved proteins across the species, but with very diferent expression profles even on the same media. An important hypothesis was that the species may have diferent physiologies at the same time point although cultivated in the same manner; similar to what was concluded for the polysaccharide-degrading enzymes by Coutinho et al. (2009). In general, both comparative transcriptomics and proteomics, while expensive to conduct, have the advantage of leting us extrapolate what is known in one species across multiple species, thus convincingly allowing the researchers to identify whether a specifc trait is unique for a species or a general biological phenomenon. Such studies have interest- ing potentials for understanding biology at a more general level.

Future perspectives – what can we learn from bacteria? If we are to guess at what the future of fungal comparative genetics will bring, it is tempting to look at the kingdom of bacteria. Genome sequencing started in bacteria 10 years before Aspergilli and Penicillia with the frst sequencing of the prokaryote Haemophilus infuenzae in 1995. As a consequence of this head start, and the advantage of the lower price of sequenc- ing the smaller bacterial genomes, the number of bacterial genomes now number in the tens of thousands (Land et al., 2014), and some individual species have more than 2000 isolates sequenced (Cook and Ussery, 2013). Furthermore, by the time the frst fungal genome was published in 2005, sequence databases, online analysis tools, gene calling pipelines and functional annotation models had been created and optimized for bacteria and were applied to fungal genomes. In some aspects, the technology of analysing bacterial genomes is both the future and the past of fungal comparative genomics.

Data sharing, maintenance and standards An area, where the experience from bacterial genomics has been a great advantage, is in the construction and use of sequence databases. Trough resources like GenBank, SWISS- PROT and RefSeq, experiences and ideas have been gained and used to create systems with more optimized downloads, searches, data submission and community involvement. Optimal design of data sharing resources has continually led to the introduction of new setups in resources like GenBank (Wheeler et al., 2013). Te fungal community has the opportunity to take advantage of these experiences and create databases that suit the needs of the research community. Some of this work has already been addressed in the structure and integration of FungiDB (Stajich et al., 2012) and EuPathDB (Aurrecoechea et al., 2013) as discussed above (section on ‘Available genome databases with comparative genomics capabilities’). 43 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 65

A key component in seting up data resources is data standards, for downloads. Although diferent data formats are widely used (FASTA, TAB, XML) the specifcations on what information to include in these formats are minimal. Problems include naming of organ- isms, identifcation numbers of organisms and genes, relations between gene and protein sequences as well as metadata like literature references and growth conditions. All of these problems make it difcult to relate data resources to one another and utilize the full poten- tial of these eforts. Comparative genomics relies heavily on the availability and quality of data without which comparisons become biased and less informative (Schnoes et al., 2013). Te genetic data from sequencing of fungal genomes will increase rapidly in the coming years and eforts should be made to establish community standards for data, metadata, data submission and download to ensure the future use and reproducibility of in silico analyses. Eforts should also be made to encourage incorporation of experimental knowledge with sequence data to gain maximum information and understanding from years of experimental knowledge and future sequencing eforts. Te increased number of data and the pace at which it is being generated, presents a tech- nical problem for the feld of bioinformatics. In recent years the rate at which sequencing data has grown exceeds the pace of Moore’s law (a doubling every 2 years). Tis puts a new pressure on the development of methods that can deal with large amounts of data without relying on larger computers. Te issue of storage must also be addressed, with funding being granted to maintain online resources and publishing houses insisting on data submissions for research that does in silico analyses.

Comparative genomics-driven insights Comparative genomics has been used in a wide range of bacterial research areas and some applications have yet to be widely studied in fungal research. One such topic is the genetic variation of species, genus and family. Some atention has been given to the core genome concept, identifying genes found in all strains of a specifc set (Fedorova et al., 2008; Gala- gan et al., 2005). Genes found in specifc strain subsets can also be used to connect possible phenotypes to genotypes, aiding experimental design (Cheeseman et al., 2014; Coutinho et al., 2009). However, these studies have been limited by the number of available genomes and general genetic variance of fungi is yet to be investigated. As more data becomes avail- able along with well-documented metadata, it will be possible to use comparative genomics to identify genes involved in specifc traits such as pathogenicity or identify sets of strain- specifc genes. Comparative genomics can also aid in construction of evolutionary models by including whole genome paterns, adding deeper resolution to the known taxonomy (Rokas et al., 2007; Snipen and Ussery, 2010), especially in the light of the expansion in number of fungal genome sequences discussed in the section on ‘Current status of genomics’. Te role of fungi in biotechnology has long made them target for optimization of produc- tion. One comparative genomics-driven method of optimizing this is modelling of primary metabolism, another feature which has primarily been successful in bacteria and yeast (Heavner et al., 2013; Monk et al., 2013; Orth et al., 2011). Genome-scale metabolic models will facilitate metabolic engineering by ofering in silico predicted targets before experimen- tal testing (Brandl and Andersen, 2015). Comparative genomics can greatly increase the evidence for these targets by looking for paterns across multiple genome sequences. Te general concept of a model, a mathematical description of a patern in data, can in theory be used to describe almost any kind of patern assuming that enough example data is 44 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 66 | Nybo et al.

provided. Creating models requires training data that represents enough variation to gener- alized the patern (Ansari et al., 2004). Sequence models can for example be used to predict secreted proteins (Braaksma et al., 2010) or identify carbohydrate-active enzymes (CAZy) (Cantarel et al., 2009).

Genome sequencing and annotation A fungal research area that has extensively used comparative genomics is the feld of sec- ondary metabolism. High-throughput in silico screening of fungal genome sequences has revealed a large number of possible gene clusters involved in secondary metabolism, open- ing up for targeted experimental studies and discovery of new compounds for industrial production. Te topics addressed including identifcation of novel clusters (Chiang et al., 2009; Hansen et al., 2011; Inglis et al., 2013), locating known clusters in new genomes, assigning functions to genes in clusters (Szewczyk et al., 2008) and locating silent gene clusters (Bergmann et al., 2007). Te use of in silico methods within this feld has developed rapidly, including machine learning approaches as well as manually curated data collections (Weber, 2014). Secondary metabolism has also been studied extensively in bacteria, exploiting the relative ease of sequencing many of these species for genome mining (Duncan et al., 2015). However, despite this wealth of data and studies, there are still problems in the feld, in particular with gene fnding. Problems include inconsistencies due to diferent sequencing technologies or quality as well as use of diferent gene callers. Some of the most common problems involve over-annotation – prediction of genes that are not actually expressed. Tere is currently no clear way of determining in silico if a gene is truly a gene, experi- ments can reveal if a gene is expressed under some conditions but testing that a gene is not expressed under any conditions is not feasible on any scale. Comparative genomics has been used to address some of the annotation problems including assembly of genomes based on other genomes and gap flling (Delcher et al., 1999a) and annotation, gene fnding and func- tions (Inglis et al., 2013). Te scale of over- or under-annotation becomes increasingly obvious as more genomes are sequenced. In one study, 4.94 million genes were identifed in silico in 1474 prokaryotic organisms and 7% of these were not found in the published annotations (Wood et al., 2012). Te study showed that 60% of these genes were genes of length between 110 and 300 amino acids, thus suggesting that short genes in general are underpredicted. Another aspect of gene fnding is the seting of parameters used for diferent prediction programs. Although some setings are widely used, actual benchmarking of diferent setings and the efect of these should be performed and published for the general community as these can have a major efect on the analysis outcome (Mancheron et al., 2011). All of these topics ofer types of information and data ideal for modelling and machine learning. Data driven models have been extensively used in bacterial research to describe and identify functional protein domains such as PFAM or PANTHER (Finn et al., 2014; Mi et al., 2005) or creating gene calling models such as GLIMMER, SnowyOwl or Prodigal (Delcher et al., 1999b; Hyat et al., 2010; Reid et al., 2014). Te success of these methods relies heavily on available training data, connections to experimental data and manual cura- tion. Although these tools can be used for fungal genomes and gene sequences, the training data is currently biased towards bacteria due to the larger number of available data (e.g. Ant- iSMASH). In order for fungal annotation to improve, more fungal data should be included 45 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 67 in the model training as well as manual curation of literature and sequence models. Creating new tools specifcally for fungi is also an option, as in the SnowyOwl gene calling algorithm (Reid et al., 2014). We strongly believe that the post-genome future of fungal genomes will be highly driven by the application of comparative genomics approaches. Te future of fungal comparative genomics promises to produce more and beter data, addressing broader research felds and integrating into more projects. Establishing good practices on how this integration is created, how it should be published, documented and shared will be vital for its success and contribution to research. Making use of the experiences from other felds should shape the way the fungal community deals with these issues and afect the way grants are funded, publications are writen and projects are planned.

References Agren, R., Liu, L., Shoaie, S., Vongsangnak, W., Nookaew, I., and Nielsen, J. (2013). Te RVEN Toolbox and its use for generating a genome-scale metabolic model for Penicillium chrysogenum. PLoS Comput. Biol. 9. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Andersen, M., Nielsen, M., and Nielsen, J. (2008a). Metabolic model integration of the bibliome, genome, metabolome and reactome of Aspergillus niger. Mol. Syst. Biol. 4, 178. Andersen, M.R., Vongsangnak, V., Panagiotou, G., Salazar, M.P., Lehmann, L., and Nielsen, J. (2008b). A trispecies Aspergillus microarray: comparative transcriptomics of three Aspergillus species. Proc. Natl. Acad. Sci. U.S.A. 105, 4387–4392. Andersen, M.R., Salazar, M.P., Schaap, P.J., van de Vondervoort, P.J.I., Culley, D., Tykaer, J., Frisvad, J.C., Nielsen, K.F., Albang, R., Albermann, K., et al. (2011). Comparative genomics of citric-acid-producing Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome Res. 21, 885–897. Andersen, M.R., Nielsen, J.B., Klitgaard, A., Petersen, L.M., Zachariasen, M., Hansen, T.J., Blicher, L.H., Gotfredsen, C.H., Larsen, T.O., Nielsen, K.F., et al. (2013). Accurate prediction of secondary metabolite gene clusters in flamentous fungi. Proc. Natl. Acad. Sci. U.S.A. 110, E99–E107. Ansari, M.Z., Yadav, G., Gokhale, R.S., and Mohanty, D. (2004). NRPS-PKS: A knowledge-based resource for analysis of NRPS-PKS megasynthases. Nucleic Acids Res. 32. Arnaud, M.B., Chibucos, M.C., Costanzo, M.C., Crabtree, J., Inglis, D.O., Lotia, A., Orvis, J., Shah, P., Skrzypek, M.S., Binkley, G., et al. (2010). Te Aspergillus Genome Database, a curated comparative genomics resource for gene, protein and sequence information for the Aspergillus research community. Nucleic Acids Res. 38, D420–D427. Aurrecoechea, C., Barreto, A., Brestelli, J., Brunk, B.P., Cade, S., Doherty, R., Fischer, S., Gajria, B., Gao, X., Gingle, A., et al. (2013). EuPathDB: Te eukaryotic pathogen database. Nucleic Acids Res. 41, 684–691. Ballester, A., Marcet-houben, M., Levin, E., Sela, N., Selma-lázaro, C., Carmona, L., Wisniewski, M., Droby, S., González-candelas, L., and Gabaldón, T. (2015). Genome, transcriptome, and functional analyses of Penicillium expansum provide new insights into secondary metabolism and pathogenicity. Mol. Plant- Microbe Interact. 28, 232–248. Baranašić, D., Zucko, J., Diminic, J., Gacesa, R., Long, P.F., Cullum, J., Hranueli, D., and Starcevic, A. (2014). Predicting substrate specifcity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing. J. Ind. Microbiol. Biotechnol. 41, 461–467. Bembom, O., Keles, S., and van der Laan, M.J. (2007). Supervised detection of conserved motifs in DNA sequences with cosmo. Stat. Appl. Genet. Mol. Biol. 6, Article 8. Benndorf, D., Müller, a., Bock, K., Manuwald, O., Herbarth, O., and Von Bergen, M. (2008). Identifcation of spore allergens from the indoor mould Aspergillus versicolor. Allergy Eur. J. Allergy Clin. Immunol. 63, 454–460. Benson, G. (1999). Tandem repeats fnder: A program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580.

46 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 68 | Nybo et al.

van den Berg, M.A., Albang, R., Albermann, K., Badger, J.H., Daran, J.-M., Driessen, A.J.M., Garcia-Estrada, C., Fedorova, N.D., Harris, D.M., Heijne, W.H.M., et al. (2008). Genome sequencing and analysis of the flamentous fungus Penicillium chrysogenum. Nat. Biotechnol. 26, 1161–1168. Bergmann, S., Schümann, J., Scherlach, K., Lange, C., Brakhage, A. a, and Hertweck, C. (2007). Genomics- driven discovery of PKS–NRPS hybrid metabolites from Aspergillus nidulans. Nat. Chem. Biol. 3, 213–217. Berry, D., Cox, M.P., and Scot, B. (2015). Draf genome sequence of the flamentous fungus Penicillium paxilli (ATCC 26601). Genome Announc. 3, e00071–15. De Bie, T., Cristianini, N., Demuth, J.P., and Hahn, M.W. (2006). CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22, 1269–1271. Bok, J.W., Hofmeister, D., Maggio-Hall, L.A., Murillo, R., Glasner, J.D., and Keller, N.P. (2006). Genomic mining for Aspergillus natural products. Chem. Biol. 13, 31–37. Braaksma, M., Martens-Uzunova, E.S., Punt, P.J., and Schaap, P.J. (2010). An inventory of the Aspergillus niger secretome by combining in silico predictions with shotgun proteomics data. BMC Genomics 11, 584. Brandl, J., and Andersen, M.R. (2015). Current state of genome-scale modeling in flamentous fungi. Biotechnol. Let. 37, 1131–1139 Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., and Batzoglou, S. (2003). LAGAN and Multi-LAGAN: Efcient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731. Caboche, S., Pupin, M., Leclère, V., Fontaine, A., Jacques, P., and Kucherov, G. (2008). NORINE: a database of nonribosomal peptides. Nucleic Acids Res. 36, D326–D331. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421. Cantarel, B.L., Coutinho, P.M., Rancurel, C., Bernard, T., Lombard, V., and Henrissat, B. (2009). Te Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 37, D233–D238. Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552. Cerqueira, G.C., Arnaud, M.B., Inglis, D.O., Skrzypek, M.S., Binkley, G., Simison, M., Miyasato, S.R., Binkley, J., Orvis, J., Shah, P., et al. (2014). Te Aspergillus Genome Database: Multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations. Nucleic Acids Res. 42, 705–710. Cheeseman, K., Ropars, J., Renault, P., Dupont, J., Gouzy, J., Branca, A., Abraham, A.-L., Ceppi, M., Conseiller, E., Debuchy, R., et al. (2014). Multiple recent horizontal transfers of a large genomic region in cheese making fungi. Nat. Commun. 5, 2876. Cherry, J.M., Hong, E.L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E.T., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S.R., et al. (2012). Saccharomyces Genome Database: Te genomics resource of budding yeast. Nucleic Acids Res. 40. Chiang, Y.-M., Szewczyk, E., Davidson, A.D., Keller, N., Oakley, B.R., and Wang, C.C.C. (2009). A gene cluster containing two fungal polyketide synthases encodes the biosynthetic pathway for a polyketide, asperfuranone, in Aspergillus nidulans. J. Am. Chem. Soc. 131, 2965–2970. Chiang, Y.-M., Oakley, B.R., Keller, N.P., and Wang, C.C.C. (2010). Unraveling polyketide synthesis in members of the genus Aspergillus. Appl. Microbiol. Biotechnol. 86, 1719–1736. Cimermancic, P., Medema, M.H., Claesen, J., Kurita, K., Wieland Brown, L.C., Mavrommatis, K., Pati, A., Godfrey, P.A., Koehrsen, M., Clardy, J., et al. (2014). Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 158, 412–421. Conway, K.R., and Boddy, C.N. (2013). ClusterMine360: A database of microbial PKS/NRPS biosynthesis. Nucleic Acids Res. 41, 402–407. Cook, H., and Ussery, D.W. (2013). Sigma factors in a thousand E. coli genomes. Environ. Microbiol. 15, 3121–3129. Coutinho, P.M., Andersen, M.R., Kolenova, K., VanKuyk, P.A., Benoit, I., Gruben, B.S., Trejo-Aguilar, B., Visser, H., van Solingen, P., Pakula, T., et al. (2009). Post-genomic insights into the plant polysaccharide degradation potential of Aspergillus nidulans and comparison to Aspergillus niger and Aspergillus oryzae. Fungal Genet. Biol. 46 (Suppl. 1), S161–S169. Crabtree, J., Angiuoli, S.V., Wortman, J.R., and White, O.R. (2007). Sybil: methods and sofware for multiple genome comparison and visualization. Methods Mol. Biol. 408, 93–108. 47 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 69

Criscuolo, A. (2011). MorePhyML: Improving the phylogenetic tree space exploration with PhyML 3. Mol. Phylogenet. Evol. 61, 944–948. Dalquen, D.A., and Dessimoz, C. (2013). Bidirectional best hits miss many orthologs in duplication-rich clades such as plants and animals. Genome Biol. Evol. 5, 1800–1806. David, H., Akesson, M., and Nielsen, J. (2003). Reconstruction of the central carbon metabolism of Aspergillus niger. Eur. J. Biochem. 270, 4243–4253. David, H., Ozçelik, I.S., Hofmann, G., and Nielsen, J. (2008). Analysis of Aspergillus nidulans metabolism at the genome-scale. BMC Genomics 9, 163. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., and Salzberg, S.L. (1999a). Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376. Delcher, A.L., Harmon, D., Kasif, S., White, O., and Salzberg, S.L. (1999b). Improved microbial gene identifcation with GLIMMER. Nucleic Acids Res. 27, 4636–4641. Duncan, K.R., Crüsemann, M., Lechner, A., Sarkar, A., Li, J., Ziemert, N., Wang, M., Bandeira, N., Moore, B.S., Dorrestein, P.C., et al. (2015). Molecular networking and patern-based genome mining improves discovery of biosynthetic gene clusters and their products from Salinispora species. Chem. Biol. Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797. Emanuelsson, O., Brunak, S., von Heijne, G., and Nielsen, H. (2007). Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2, 953–971. Enright, A.J., Van Dongen, S., and Ouzounis, C.A. (2002). An efcient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584. Fedorova, N.D., Khaldi, N., Joardar, V.S., Maiti, R., Amedeo, P., Anderson, M.J., Crabtree, J., Silva, J.C., Badger, J.H., Albarraq, A., et al. (2008). Genomic islands in the pathogenic flamentous fungus Aspergillus fumigatus. PLoS Genet. 4, e1000046. Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., [23] Holm, L., Mistry, J., et al. (2014). Pfam: Te protein families database. Nucleic Acids Res. 42. Flipphi, M., Sun, J., Robellet, X., Karafa, L., Fekete, E., Zeng, A.-P., and Kubiecek, C.P. (2009). Biodiversity and evolution of primary carbon metabolism in Aspergillus nidulans and other Aspergillus spp. Fungal Genet. Biol. 46(Suppl. 1), S19–S44. Förster, J., Famili, I., Fu, P., Palsson, B., and Nielsen, J. (2003). Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 13, 244–253. Futagami, T., Mori, K., Yamashita, A., Wada, S., Kajiwara, Y., Takashita, H., Omori, T., Takegawa, K., Tashiro, K., Kuhara, S., et al. (2011). Genome sequence of the white koji mold Aspergillus kawachii IFO 4308, used for brewing the Japanese distilled spirit shochu. Eukaryot. Cell 10, 1586–1587. Galagan, J.E., Calvo, S.E., Cuomo, C., Ma, L.-J., Wortman, J.R., Batzoglou, S., Lee, S.-I., Baştürkmen, M., Spevak, C.C., Cluterbuck, J., et al. (2005). Sequencing of Aspergillus nidulans and comparative analysis with A. fumigatus and A. oryzae. Nature 438, 1105–1115. Gilsenan, J.M., Cooley, J., and Bowyer, P. (2012). CADRE: Te Central Aspergillus Data REpository 2012. Nucleic Acids Res. 40, 660–666. Grigoriev, I.V., Nordberg, H., Shabalov, I., Aerts, A., Cantor, M., Goodstein, D., Kuo, A., Minovitsky, S., Nikitin, R., Ohm, R.A., et al. (2012). Te genome portal of the Department of Energy Joint Genome Institute. Nucleic Acids Res 40, D26–D32. Grigoriev, I.V., Nikitin, R., Haridas, S., Kuo, A., Ohm, R., Otillar, R., Riley, R., Salamov, A., Zhao, X., Korzeniewski, F., et al. (2014). MycoCosm portal: Gearing up for 1000 fungal genomes. Nucleic Acids Res. 42, 699–704. Haas, B.J. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666. Hansen, B.G., Genee, H.J., Kaas, C.S., Nielsen, J.B., Regueira, T.B., Mortensen, U.H., Frisvad, J.C., and Patil, K.R. (2011). A new class of IMP dehydrogenase with a role in self-resistance of mycophenolic acid producing fungi. BMC Microbiol. 11, 202. Heavner, B.D., Smallbone, K., Price, N.D., and Walker, L.P. (2013). Version 6 of the consensus yeast metabolic network refnes biochemical coverage and improves model performance. Database (Oxford) 2013, bat059. Horton, P., Park, K.J., Obayashi, T., Fujita, N., Harada, H., Adams-Collier, C.J., and Nakai, K. (2007). WoLF [24] PSORT: Protein localization predictor. Nucleic Acids Res. 35. Huerta-Cepas, J., Dopazo, J., and Gabaldón, T. (2010). ETE: a python Environment for Tree Exploration. BMC Bioinformatics 11, 24. 48 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 70 | Nybo et al.

Hyat, D., Chen, G.-L., Locascio, P.F., Land, M.L., Larimer, F.W., and Hauser, L.J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identifcation. BMC Bioinformatics 11, 119. Inglis, D.O., Binkley, J., Skrzypek, M.S., Arnaud, M.B., Cerqueira, G.C., Shah, P., Wymore, F., Wortman, J.R., and Sherlock, G. (2013). Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae. BMC Microbiol 13, 91. Karafa, L., and Kubicek, C.P. (2003). Aspergillus niger citric acid accumulation: do we understand this well working black box? Appl. Microbiol. Biotechnol. 61, 189–196. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066. Kent, W.J. (2002). BLAT – Te BLAST-Like Alignment Tool. Genome Res. 12, 656–664. Khaldi, N., Seifuddin, F.T., Turner, G., Haf, D., Nierman, W.C., Wolfe, K.H., and Fedorova, N.D. (2010). SMURF: Genomic mapping of fungal secondary metabolite clusters. Fungal Genet. Biol. 47, 736–741. Kis-Papo, T., Weig, A.R., Riley, R., Peršoh, D., Salamov, A., Sun, H., Lipzen, A., Wasser, S.P., Rambold, G., Grigoriev, I.V., et al. (2014). Genomic adaptations of the halophilic Dead Sea flamentous fungus Eurotium rubrum. Nat. Commun. 5, 3745. Kniemeyer, O. (2011). Proteomics of eukaryotic microorganisms: Te medically and biotechnologically important fungal genus Aspergillus. Proteomics 11, 3232–3243. Kumar, S., Tamura, K., and Nei, M. (1994). MEGA: Molecular Evolutionary Genetics Analysis sofware for microcomputers. Comput. Appl. Biosci. 10, 189–191. Land, M.L., Hyat, D., Jun, S.-R., Kora, G.H., Hauser, L.J., Lukjancenko, O., and Ussery, D.W. (2014). Quality scores for 32,000 genomes. Stand. Genomic Sci. 9, 20. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGetigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948. Lassmann, T., and Sonnhammer, E.L.L. (2005). Kalign – an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760. [25] Li, B., Zong, Y., Du, Z., Chen, Y., Zhang, Z., Qin, G., Zhao, W., and Tian, S. (2015). Genomic characterization reveals insights into patulin biosynthesis and pathogenicity in Penicillium species. Mol. Plant Microbe Interact. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. (2009). Te Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079. Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: identifcation of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189. Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). SOAP: Short oligonucleotide alignment program. Bioinformatics 24, 713–714. Lind, A.L., Wisecaver, J.H., Smith, T.D., Feng, X., Calvo, A.M., and Rokas, A. (2015). Examining the evolution of the regulatory circuit controlling secondary metabolism and development in the fungal genus Aspergillus. PLOS Genet. 11, e1005096. Linz, J.E., Wee, J., and Roze, L.V. (2014). Aspergillus parasiticus SU-1 genome sequence, predicted chromosome structure, and comparative gene expression under afatoxin-inducing conditions: evidence that diferential expression contributes to species phenotype. Eukaryot. Cell 13, 1113–1123. [26] Liu, G., Zhang, L., Wei, X., Zou, G., Qin, Y., Ma, L., Li, J., Zheng, H., Wang, S., Wang, C., et al. (2013a). Genomic and secretomic analyses reveal unique features of the lignocellulolytic enzyme system of Penicillium decumbens. PLoS One 8. Liu, J., Gao, Q., Xu, N., and Liu, L. (2013b). Genome-scale reconstruction and in silico analysis of Aspergillus terreus metabolism. Mol. Biosyst. 9, 1939–1948. Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P.M., and Henrissat, B. (2014). Te carbohydrate- active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42, 490–495. Machida, M., Asai, K., Sano, M., Tanaka, T., Kumagai, T., Terai, G., Kusumoto, K.-I., Arima, T., Akita, O., Kashiwagi, Y., et al. (2005). Genome sequencing and analysis of Aspergillus oryzae. Nature 438, 1157–1161. Mancheron, A., Uricaru, R., and Rivals, E. (2011). An alternative approach to multiple genome comparison. Nucleic Acids Res. 1–11. Marcet-Houben, M., Ballester, A.-R., de la Fuente, B., Harries, E., Marcos, J.F., González-Candelas, L., and Gabaldón, T. (2012). Genome sequence of the necrotrophic fungus Penicillium digitatum, the main postharvest pathogen of citrus. BMC Genomics 13, 646. 49 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 71

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). Te genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303. Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergrif, J., Rabkin, S., Guo, N., Muruganujan, A., [27] Doremieux, O., Campbell, M.J., et al. (2005). Te PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 33. Mitsuguchi, H., Seshime, Y., Fujii, I., Shibuya, M., Ebizuka, Y., and Kushiro, T. (2009). Biosynthesis of steroidal antibiotic fusidanes: Functional analysis of oxidosqualene cyclase and subsequent tailoring enzymes from Aspergillus fumigatus. J. Am. Chem. Soc. 131, 6402–6411. Monk, J.M., Charusanti, P., Aziz, R.K., Lerman, J.A., Premyodhin, N., Orth, J.D., Feist, A.M., and Palsson, B.Ø. (2013). Genome-scale metabolic reconstructions of multiple Escherichia coli strains highlight strain-specifc adaptations to nutritional environments. Proc. Natl. Acad. Sci. U.S.A. 110, 20338–20343. Nicholson, M.J., Koulman, A., Monahan, B.J., Pritchard, B.L., Payne, G.A., and Scot, B. (2009). Identifcation of two afatrem biosynthesis gene loci in Aspergillus favus and metabolic engineering of Penicillium paxilli to elucidate their function. Appl. Environ. Microbiol. 75, 7469–7481. Nierman, W.C., Pain, A., Anderson, M.J., Wortman, J.R., Kim, H.S., Arroyo, J., Berriman, M., Abe, K., Archer, D.B., Bermejo, C., et al. (2005). Genomic sequence of the pathogenic and allergenic flamentous fungus Aspergillus fumigatus. Nature 438, 1151–1156. Nierman, W.C., Yu, J., Fedorova-Abrams, N.D., Losada, L., Cleveland, T.E., Bhatnagar, D., Bennet, J.W., Dean, R., and Payne, G.A. (2015a). Genome sequence of Aspergillus favus NRRL 3357, a strain that causes afatoxin contamination of food and feed. Genome Announc. 3, e00168–15. Nierman, W.C., Fedorova-Abrams, N.D., and Andrianopoulos, A. (2015b). Genome sequence of the AIDS- associated pathogen Penicillium marnefei (ATCC18224) and its near taxonomic relative Talaromyces stipitatus (ATCC10500). Genome Announc. 3, e01559–14. O’Brien, K.P., Remm, M., and Sonnhammer, E.L.L. (2005). Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480. Okabe, M., Lies, D., Kanamasa, S., and Park, E.Y. (2009). Biotechnological production of itaconic acid and its biosynthesis in Aspergillus terreus. Appl. Microbiol. Biotechnol. 84, 597–606. Orth, J.D., Conrad, T.M., Na, J., Lerman, J.A., Nam, H., Feist, A.M., and Palsson, B.Ø. (2011). A comprehensive genome-scale reconstruction of Escherichia coli Metabolism. Mol. Syst. Biol. 7, 535. Osbourn, A. (2010). Secondary metabolic gene clusters: evolutionary toolkits for chemical innovation. Trends Genet. 26, 449–457. Ozturkoglu Budak, S., Zhou, M., Brouwer, C., Wiebenga, A., Benoit, I., Di Falco, M., Tsang, A., and de Vries, R.P. (2014). A genomic survey of proteases in aspergilli. BMC Genomics 15, 523. Park, K. (2009). Development of an analysis program of type I polyketide synthase gene clusters using homology search and profle hidden Markov model. J. Microbiol. Biotechnol. 19, 140–146. Van De Peer, Y., and De Wachter, R. (1994). Treecon for windows: A sofware package for the construction and drawing of evolutionary trees for the microsof windows environment. Bioinformatics 10, 569–570. Pel, H.J., de Winde, J.H., Archer, D.B., Dyer, P.S., Hofmann, G., Schaap, P.J., Turner, G., de Vries, R.P., Albang, R., Albermann, K., et al. (2007). Genome sequencing and analysis of the versatile cell factory Aspergillus niger CBS 513.88. Nat. Biotechnol. 25, 221–231. Pi, B., Yu, D., Dai, F., Song, X., Zhu, C., Li, H., and Yu, Y. (2015). A genomics based discovery of secondary metabolite biosynthetic gene clusters in Aspergillus ustus. PLoS One 10, e0116089. Pitkänen, E., Jouhten, P., Hou, J., Syed, M.F., Blomberg, P., Kludas, J., Oja, M., Holm, L., Pentilä, M., Rousu, J., et al. (2014). Comparative genome-scale reconstruction of gapless metabolic networks for present and ancestral species. PLoS Comput. Biol. 10, e1003465. Price, A.L., Jones, N.C., and Pevzner, P.A. (2005). De novo identifcation of repeat families in large genomes. [28] Bioinformatics 21. Rambaut, A. (2009). FigTree, a graphical viewer of phylogenetic trees. Inst. Evol. Biol. Univ. Edinburgh. [29] Reid, I., O’Toole, N., Zabaneh, O., Nourzadeh, R., Dahdouli, M., Abdellateef, M., Gordon, P.M.K., Soh, J., Butler, G., Sensen, C.W., et al. (2014). SnowyOwl: accurate prediction of fungal genes by using RNA- Seq and homology information to select among ab initio models. BMC Bioinformatics 15, 229. Retief, J.D. (2000). Phylogenetic analysis using PHYLIP. Methods Mol. Biol. 132, 243–258. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the European Molecular Biology Open Sofware Suite. Trends Genet. 16, 276–277. Riley, D.R., Angiuoli, S.V., Crabtree, J., Dunning Hotopp, J.C., and Tetelin, H. (2012). Using Sybil for interactive comparative genomics of microbes on the web. Bioinformatics 28, 160–166. 50 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P 72 | Nybo et al.

Rokas, A., Payne, G., Fedorova, N.D., Baker, S.E., Machida, M., Yu, J., Georgianna, D.R., Dean, R.A., Bhatnagar, D., Cleveland, T.E., et al. (2007). What can comparative genomics tell us about species concepts in the genus Aspergillus? Stud. Mycol. 59, 11–17. Rötig, M., Medema, M.H., Blin, K., Weber, T., Rausch, C., and Kohlbacher, O. (2011). NRPSpredictor2 – A web server for predicting NRPS adenylation domain specifcity. Nucleic Acids Res. 39. Salazar, M., Vongsangnak, W., Panagiotou, G., Andersen, M., and Nielsen, J. (2009). Uncovering transcriptional regulation of glycerol metabolism in Aspergilli through genome-wide gene expression data analysis. Mol. Genet. Genomics 282, 571–586. Sanchez, J.F., Somoza, A.D., Keller, N.P., and Wang, C.C.C. (2012). Advances in Aspergillus secondary metabolite research in the post-genomic era. Nat. Prod. Rep. 29, 351–371. Schnoes, A.M., Ream, D.C., Torman, A.W., Babbit, P.C., and Friedberg, I. (2013). Biases in the experimental annotations of protein function and their efect on our understanding of protein function space. PLoS Comput. Biol. 9, e1003063. Snipen, L., and Ussery, D.W. (2010). Standard operating procedure for computing pangenome trees. Stand. Genomic Sci. 2, 135–141. Stajich, J.E., Harris, T., Brunk, B.P., Brestelli, J., Fischer, S., Harb, O.S., Kissinger, J.C., Li, W., Nayak, V., Pinney, D.F., et al. (2012). FungiDB: An integrated functional genomics database for fungi. Nucleic Acids Res. 40, 675–681. Stamatakis, A. (2014). RxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313. Stoll, D. a., Link, S., Kulling, S., Geisen, R., and Schmidt-Heydt, M. (2014). Comparative proteome analysis of Penicillium verrucosum grown under light of short wavelength shows an induction of stress-related proteins associated with modifed mycotoxin biosynthesis. Int. J. Food Microbiol. 175, 20–29. [30] Strope, P.K., Skelly, D.A., Kozmin, S.G., Mahadevan, G., Stone, E.A., Magwene, P.M., Dietrich, F.S., and McCusker, J.H. (2015). Te 100-genomes strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen. Genome Res. gr.185538.114 –. [31] Subramanian, A.R., Hiran, S., Steinkamp, R., Meinicke, P., Corel, E., and Morgenstern, B. (2010). DIALIGN-TX and multiple protein alignment using secondary structure information at GOBICS. Nucleic Acids Res. 38. Szewczyk, E., Chiang, Y.-M., Oakley, C.E., Davidson, A.D., Wang, C.C.C., and Oakley, B.R. (2008). Identifcation and characterization of the asperthecin gene cluster of Aspergillus nidulans. Appl. Environ. Microbiol. 74, 7607–7612. Terabayashi, Y., Shimizu, M., Kitazume, T., Masuo, S., Fujii, T., and Takaya, N. (2012). Conserved and specifc responses to hypoxia in Aspergillus oryzae and Aspergillus nidulans determined by comparative transcriptomics. Appl. Microbiol. Biotechnol. 93, 305–317. [32] Tomas-Chollier, M., Sand, O., Turatsinze, J.V., Janky, R., Defrance, M., Vervisch, E., Brohée, S., and van Helden, J. (2008). RSAT: regulatory sequence analysis tools. Nucleic Acids Res. 36. [33] Umemura, M., Koike, H., Nagano, N., Ishii, T., Kawano, J., Yamane, N., Kozone, I., Horimoto, K., Shin-ya, K., Asai, K., et al. (2013). MIDDAS-M: Motif-independent de novo detection of secondary metabolite gene clusters through the integration of genome sequencing and transcriptome data. PLoS One 8. Vongsangnak, W., Olsen, P., Hansen, K., Krogsgaard, S., and Nielsen, J. (2008). Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae. BMC Genomics 9, 245. Wallace, I.M., O’Sullivan, O., Higgins, D.G., and Notredame, C. (2006). M-Cofee: combining multiple sequence alignment methods with T-Cofee. Nucleic Acids Res. 34, 1692–1699. Wang, F.-Q., Zhong, J., Zhao, Y., Xiao, J., Liu, J., Dai, M., Zheng, G., Zhang, L., Yu, J., Wu, J., et al. (2014). Genome sequencing of high-penicillin producing industrial strain of Penicillium chrysogenum. BMC Genomics 15 (Suppl. 1), S11. Weber, T. (2014). In silico tools for the analysis of antibiotic biosynthetic pathways. Int. J. Med. Microbiol. 304, 230–235. Weber, T., Blin, K., Duddela, S., Krug, D., Kim, H.U., Bruccoleri, R., Lee, S.Y., Fischbach, M. a., Muller, R., Wohlleben, W., et al. (2015). antiSMASH 3.0 – a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res. 1–7. [34] Wheeler, D.L., Chappey, C., Lash, A.E., Leipe, D.D., Madden, T.L., Schuler, G.D., Tatusova, T. a, and Rapp, B. a (2013). Database resources of the National Center for Biotechnology Information\. 28, 10–14. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Prüss, M., Reuter, I., and Schacherer, F. (2000). TRNSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319. 51 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Comparative Genomics in Aspergillus and Penicillium | 73

Wolf, Y.I., and Koonin, E.V. (2012). A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol. Evol. 4, 1286–1294. Wood, D.E., Lin, H., Levy-Moonshine, A., Swaminathan, R., Chang, Y.-C., Anton, B.P., Osmani, L., Stefen, M., Kasif, S., and Salzberg, S.L. (2012). Tousands of missed genes found in bacterial genomes and their analysis with COMBREX. Biol. Direct 7, 37. Yang, Y., Zhao, H., Barrero, R. a, Zhang, B., Sun, G., Wilson, I.W., Xie, F., Walker, K.D., Parks, J.W., Bruce, R., et al. (2014). Genome sequencing and analysis of the paclitaxel-producing endophytic fungus Penicillium aurantiogriseum NRRL 62431. BMC Genomics 15, 69. Yu, J., Jurick, W.M., Cao, H., Yin, Y., Gaskins, V.L., Losada, L., Zafar, N., Kim, M., Bennet, J.W., and Nierman, W.C. (2014). Draf genome sequence of Penicillium expansum strain R19, which causes postharvest decay of apple fruit. Genome Announc. 2, 2013–2014.

52 UNCORRECTED PROOF Date: 16:49 Tuesday 23 February 2016 File: Aspergillus-Penicillium 2P Bibliography

M. R. Andersen, M. P. Salazar, P. J. Schaap, P. J. Van De Vondervoort, D. Culley, J. Thykaer, J. C. Frisvad, K. F. Nielsen, R. Albang, K. Al- bermann, R. M. Berka, G. H. Braus, S. A. Braus-Stromeyer, L. M. Cor- rochano, Z. Dai, P. W. Van Dijck, G. Hofmann, L. L. Lasure, J. K. Magnu- son, H. Menke, M. Meijer, S. L. Meijer, J. B. Nielsen, M. L. Nielsen, A. J. Van Ooyen, H. J. Pel, L. Poulsen, R. A. Samson, H. Stam, A. Tsang, J. M. Van Den Brink, A. Atkins, A. Aerts, H. Shapiro, J. Pangilinan, A. Salamov, Y. Lou, E. Lindquist, S. Lucas, J. Grimwood, I. V. Grigoriev, C. P. Ku- bicek, D. Martinez, N. N. Van Peij, J. A. Roubos, J. Nielsen, and S. E. Baker. Comparative genomics of citric-acid-producing Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome Research, 21 (6):885–897, 2011. ISSN 10889051. doi: 10.1101/gr.112169.110.

T. Awakawa, X. L. Yang, T. Wakimoto, and I. Abe. Pyranonigrin E: A PKS-NRPS hybrid metabolite from aspergillus niger identified by genome mining. ChemBioChem, 14(16):2095–2099, 2013. ISSN 14394227. doi: 10.1002/cbic.201300430.

G. F. Bills, Q. Yue, L. Chen, Y. Li, Z. An, and J. C. Frisvad. Aspergillus mulundensis sp. nov., a new species for the fungus producing the antifungal echinocandin lipopeptides, mulundocandins. Journal of Antibiotics, 69(3): 141–148, 2016. ISSN 18811469. doi: 10.1038/ja.2015.105. URL http: //dx.doi.org/10.1038/ja.2015.105.

D. Boettger and C. Hertweck. Molecular Diversity Sculpted by Fungal PKS- NRPS Hybrids. ChemBioChem, 14(1):28–42, 2013. ISSN 14394227. doi: 10.1002/cbic.201200624.

H. U. B¨ohnert, I. Fudal, W. Dioh, D. Tharreau, J.-L. Notteghem, and M.-H. Lebrun. A putative polyketide synthase/peptide syn- thetase from Magnaporthe grisea signals pathogen attack to resistant rice. The Plant cell, 16(9):2499–513, 2004. ISSN 1040-4651. doi: 10.1105/tpc.104.022715. URL http://www.ncbi.nlm.nih.gov/pubmed/

53 15319478$\delimiter"026E30F$nhttp://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=PMC520948.

K. E. Bushley and B. G. Turgeon. Phylogenomics reveals subfamilies of fungal nonribosomal peptide synthetases and their evolutionary relation- ships. BMC evolutionary biology, 10:26, 2010. ISSN 1471-2148. doi: 10.1186/1471-2148-10-26.

C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L. Madden. BLAST+: architecture and applications. BMC Bioin- formatics, 10(1):421, 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10- 421. URL http://www.biomedcentral.com/1471-2105/10/421.

A. Chen, J. Frisvad, B. Sun, J. Varga, S. Kocsub´e,J. Dijksterhuis, D. Kim, S.-B. Hong, J. Houbraken, and R. Samson. Aspergillus section Nidu- lantes (formerly Emericella): Polyphasic taxonomy, chemistry and bi- ology. Studies in Mycology, pages 1–118, 2016. ISSN 01660616. doi: 10.1016/j.simyco.2016.10.001.

S. L. Clugston, S. A. Sieber, M. A. Marahiel, and C. T. Walsh. Chirality of peptide bond-forming condensation domains in nonribosomal peptide synthetases: the C5 domain of tyrocidine synthetase is a (D)C(L) catalyst. Biochemistry, 42(41):12095–104, oct 2003. ISSN 0006-2960. doi: 10.1021/ bi035090+. URL http://www.ncbi.nlm.nih.gov/pubmed/14556641.

R. A. Coulombe and R. P. Sharma. Clearance and excretion of intratra- cheally and orally administered aflatoxin B1 in the rat. Food and Chem- ical Toxicology, 23(9):827–830, 1985. ISSN 02786915. doi: 10.1016/0278- 6915(85)90283-2.

R. P. de Vries, J. Visser, P. Ronald, de Vries, R., and P. Aspergillus Enzymes Involved in Degradation of Plant Cell Wall Polysaccharides. Microbiology and Molecular Biology Reviews, 65(4):497–522, 2001. ISSN 1092-2172. doi: 10.1128/MMBR.65.4.497.

P. M. Dewick. Medicinal Natural Products. John Wiley & Sons, Ltd, Chich- ester, UK, feb 2009. ISBN 9780470742761. doi: 10.1002/9780470742761. URL http://doi.wiley.com/10.1002/9780470742761.

B. Diez, V. Ii, J. F. Martin, and J. L. Barredosll. The Cluster of Penicillin Biosynthetic Genes. Biochemistry, 265(27):16358–16365, 1990.

B. Dujon, D. Sherman, G. Fischer, P. Durrens, S. Casaregela, I. Lafentaine, J. De Montigny, C. Marck, C. Neuv´eglise,E. Talla, N. Goffard, L. Frangeul,

54 M. Algie, V. Anthouard, A. Babour, V. Barbe, S. Barnay, S. Blanchin, J. M. Beckerich, E. Beyne, C. Bleykasten, A. Boisram´e,J. Boyer, L. Cat- tolico, F. Confanioleri, A. De Daruvar, L. Despons, E. Fabre, C. Fairhead, H. Ferry-Dumazet, A. Groppi, F. Hantraye, C. Hennequin, N. Jauniaux, P. Joyot, R. Kachouri, A. Kerrest, R. Koszul, M. Lemaire, I. Lesur, L. Ma, H. Muller, J. M. Nicaud, M. Nikolski, S. Oztas, O. Ozier-Kalogeropoulos, S. Pellenz, S. Potter, G. F. Richard, M. L. Straub, A. Suleau, D. Swennen, F. Tekai, M. W´esolowski-Louvel, E. Westhof, B. Wirth, M. Zeniou-Meyer, I. Zivanovic, M. Bolotin-Fukuhara, A. Thierry, C. Bouchter, B. Cau- dron, C. Scarpelli, C. Gaillardin, J. Weissebach, P. Wincker, and J. L. Souciet. Genome evolution in yeasts. Nature, 430(6995):35–44, 2004. ISSN 00280836. doi: 10.1038/nature02579.

M. Eisendle, H. Oberegger, I. Zadra, and H. Haas. The siderophore system is essential for viability of Aspergillus nidulans: Functional analysis of two genes encoding L-ornithine N5-monooxygenase (sidA) and a non-ribosomal peptide synthetase (sidC). Molecular Microbiology, 49(2):359–375, 2003. ISSN 0950382X. doi: 10.1046/j.1365-2958.2003.03586.x.

B. S. E. Evans. Nonribosomal Peptide and Polyketide Biosynthesis. 2016. ISBN 9781493933730. doi: 10.1007/978-1-4939-3375-4.

R. Finking and M. a. Marahiel. Biosynthesis of nonribosomal pep- tides. Annual review of microbiology, 58:453–88, jan 2004. ISSN 0066- 4227. doi: 10.1146/annurev.micro.58.030603.123615. URL http:// www.ncbi.nlm.nih.gov/pubmed/15487945.

M. A. Fischbach, C. T. Walsh, and J. Clardy. The evolution of gene col- lectives: How natural selection drives chemical innovation. Pnas, 105(12): 4601–4608, 2008. ISSN 0027-8424. doi: 10.1073/pnas.0709132105.

J. C. Frisvad and T. O. Larsen. Chemodiversity in the genus Aspergillus. Applied Microbiology and Biotechnology, 99(19):7859–7877, 2015. ISSN 14320614. doi: 10.1007/s00253-015-6839-z.

J. E. Galagan, S. E. Calvo, C. Cuomo, L.-J. Ma, J. R. Wortman, S. Batzoglou, S.-I. Lee, M. Ba¸st¨urkmen,C. C. Spevak, J. Clutterbuck, V. Kapitonov, J. Jurka, C. Scazzocchio, M. Farman, J. Butler, S. Pur- cell, S. Harris, G. H. Braus, O. Draht, S. Busch, C. D’Enfert, C. Bouch- ier, G. H. Goldman, D. Bell-Pedersen, S. Griffiths-Jones, J. H. Doonan, J. Yu, K. Vienken, A. Pain, M. Freitag, E. U. Selker, D. B. Archer, M. A.´ Pe˜nalva, B. R. Oakley, M. Momany, T. Tanaka, T. Kumagai, K. Asai, M. Machida, W. C. Nierman, D. W. Denning, M. Caddick, M. Hynes,

55 M. Paoletti, R. Fischer, B. Miller, P. Dyer, M. S. Sachs, S. A. Os- mani, and B. W. Birren. Sequencing of Aspergillus nidulans and com- parative analysis with A. fumigatus and A. oryzae. Nature, 438(7071): 1105–1115, 2005. ISSN 0028-0836. doi: 10.1038/nature04341. URL http://www.nature.com/doifinder/10.1038/nature04341.

A. Gallo, M. Ferrara, and G. Perrone. Phylogenetic study of polyketide synthases and nonribosomal peptide synthetases involved in the biosyn- thesis of mycotoxins. Toxins, 5(4):717–742, 2013. ISSN 20726651. doi: 10.3390/toxins5040717.

K. Hagimori, T. Fukuda, Y. Hasegawa, S. Omura, and H. Tomoda. Fungal malformins inhibit bleomycin-induced G2 checkpoint in Jurkat cells. Bio- logical & pharmaceutical bulletin, 30(8):1379–83, 2007a. ISSN 0918-6158. doi: 10.1248/bpb.30.1379.

K. Hagimori, T. Fukuda, Y. Hasegawa, S. Omura, and H. Tomoda. Fungal malformins inhibit bleomycin-induced G2 checkpoint in Jurkat cells. Bio- logical & pharmaceutical bulletin, 30(8):1379–83, 2007b. ISSN 0918-6158. doi: 10.1248/bpb.30.1379.

L. H. Hartwell and M. B. Kastan. Cell cycle control and cancer. Science (New York, N.Y.), 266(5192):1821–8, dec 1994. ISSN 0036-8075. URL http://www.ncbi.nlm.nih.gov/pubmed/7997877.

G. Holt and K. D. MacDonald. Isolation of strains with increased penicillin yield after hybridization in Aspergillus nidulans. Nature, 219(5154):636– 637, 1968. ISSN 00280836. doi: 10.1038/219636a0.

V. Hubka, A. Nov´akov´a,S. W. Peterson, J. C. Frisvad, F. Sklen´aˇr,T. Mat- suzawa, A. Kub´atov´a,and M. Kolaˇr´ık. A reappraisal of Aspergillus sec- tion Nidulantes with descriptions of two new sterigmatocystin-producing species. Plant Systematics and Evolution, 302(9):1267–1299, 2016. ISSN 16156110. doi: 10.1007/s00606-016-1331-5.

N. Khaldi and K. H. Wolfe. Evolutionary Origins of the Fumonisin Sec- ondary Metabolite Gene Cluster in Fusarium verticillioides and Aspergillus niger. International Journal of Evolutionary Biology, 2011:1–7, 2011. ISSN 2090-052X. doi: 10.4061/2011/423821. URL http://www.hindawi.com/ journals/ijeb/2011/423821/.

N. Khaldi, J. Collemare, M.-H. Lebrun, and K. H. Wolfe. Evidence for horizontal transfer of a secondary metabolite gene cluster between fungi.

56 Genome biology, 9(1):R18, 2008. ISSN 1465-6906. doi: 10.1186/gb-2008- 9-1-r18.

M. A. Klich. Aspergillus flavus: the major producer of aflatoxin. Molecular Plant Pathology, 8(6):713–722, nov 2007. ISSN 1464-6722. doi: 10.1111/ j.1364-3703.2007.00436.x. URL http://www.ncbi.nlm.nih.gov/pubmed/ 20507532http://doi.wiley.com/10.1111/j.1364-3703.2007.00436.x.

A. Klitgaard, J. B. Nielsen, R. J. N. Frandsen, M. R. Andersen, and K. F. Nielsen. Combining Stable Isotope Labeling and Molecular Networking for Biosynthetic Pathway Characterization. Analytical Chemistry, 87(13): 6520–6526, jul 2015. ISSN 0003-2700. doi: 10.1021/acs.analchem.5b01934. URL http://pubs.acs.org/doi/abs/10.1021/acs.analchem.5b01934.

G. Koczyk, A. Dawidziuk, and D. Popiel. The distant siblings - A phy- logenomic roadmap illuminates the origins of extant diversity in fungal aromatic polyketide biosynthesis. Genome Biology and Evolution, 7(11): 3132–3154, 2015. ISSN 17596653. doi: 10.1093/gbe/evv204.

S. Kroken, N. L. Glass, J. W. Taylor, O. C. Yoder, and B. G. Turgeon. Phy- logenomic analysis of type I polyketide synthase genes in pathogenic and saprobic ascomycetes. Proceedings of the National Academy of Sciences of the United States of America, 100(26):15670–5, 2003. ISSN 0027-8424. doi: 10.1073/pnas.2532165100. URL http://www.pnas.org/content/100/26/ 15670.full.

Y. Lamboni, K. F. Nielsen, A. R. Linnemann, Y. K. Gezgin, K. Hell, M. J. R. Nout, E. J. Smid, M. Tamo, M. A. J. S. Van Boekel, J. B. Hoof, and J. C. Frisvad. Diversity in secondary metabolites including mycotoxins from strains of aspergillus section nigri isolated from raw cashew nuts from benin, west africa. PLoS ONE, 11(10):1–14, 2016. ISSN 19326203. doi: 10.1371/journal.pone.0164310.

D. P. Lawrence, S. Kroken, B. M. Pryor, and A. E. Arnold. Interk- ingdom gene transfer of a hybrid NPS/PKS from bacteria to filamen- tous Ascomycota. PloS one, 6(11):e28231, jan 2011. ISSN 1932-6203. doi: 10.1371/journal.pone.0028231. URL http://journals.plos.org/ plosone/article?id=10.1371/journal.pone.0028231.

A. P. MacCabe, H. V. Liemptll, H. Palissall, S. E. Unkless, and E. Pfeiferll. &(L-cu-Aminoadipyl)-L-cysteinyl-D-valine Synthetase from Aspergillus nidulans. The Journal of Biological Chemistry, 266(19):12646–12654, 1991.

57 M. Machida, K. Asai, M. Sano, T. Tanaka, T. Kumagai, G. Terai, K. I. Kusumoto, T. Arima, O. Akita, Y. Kashiwagi, K. Abe, K. Gomi, H. Hori- uchi, K. Kitamoto, T. Kobayashi, M. Takeuchi, D. W. Denning, J. E. Gala- gan, W. C. Nierman, J. Yu, D. B. Archer, J. W. Bennett, D. Bhatnagar, T. E. Cleveland, N. D. Fedorova, O. Gotoh, H. Horikawa, A. Hosoyama, M. Ichinomiya, R. Igarashi, K. Iwashita, P. R. Juvvadi, M. Kato, Y. Kato, T. Kin, A. Kokubun, H. Maeda, N. Maeyama, J. I. Maruyama, H. Na- gasaki, T. Nakajima, K. Oda, K. Okada, I. Paulsen, K. Sakamoto, T. Sawano, M. Takahashi, K. Takase, Y. Terabayashi, J. R. Wortman, O. Yamada, Y. Yamagata, H. Anazawa, Y. Hata, Y. Koide, T. Komori, Y. Koyama, T. Minetoki, S. Suharnan, A. Tanaka, K. Isono, S. Kuhara, N. Ogasawara, and H. Kikuchi. Genome sequencing and analysis of As- pergillus oryzae. Nature, 438(7071):1157–1161, 2005. ISSN 14764687. doi: 10.1038/nature04300.

M. A. Marahiel. A structural model for multimodular NRPS assembly lines. Nat. Prod. Rep., 33(2):136–140, 2016. ISSN 0265-0568. doi: 10.1039/ C5NP00082C. URL http://xlink.rsc.org/?DOI=C5NP00082C.

Y. Miyake, C. Ito, M. Itoigawa, and T. Osawa. Isolation of the Antioxidant Pyranonigrin-A from Rice Mold Starters Used in the Manufacturing Pro- cess of Fermented Foods. Bioscience, Biotechnology, and Biochemistry, 71 (10):2515–2521, oct 2007. ISSN 0916-8451. doi: 10.1271/bbb.70310. URL http://www.tandfonline.com/doi/full/10.1271/bbb.70310.

D. Moore, G. Robson, and T. Trinci. 21st Century Guidebook to Fungi. Cambridge University Press, Cambridge, 2011. ISBN 9780511977022. doi: 10.1017/CBO9780511977022. URL http://ebooks.cambridge.org/ref/ id/CBO9780511977022.

H. D. Mootz, D. Schwarzer, and M. A. Marahiel. Ways of assembling complex natural products on modular nonribosomal peptide synthetases. ChemBioChem, 3(6):490–504, 2002. ISSN 14394227. doi: 10.1002/1439- 7633(20020603)3:6h490::AID-CBIC490i3.0.CO;2-N.

W. C. Nierman, A. Pain, M. J. Anderson, J. R. Wortman, H. S. Kim, J. Arroyo, M. Berriman, K. Abe, D. B. Archer, C. Bermejo, J. Bennett, P. Bowyer, D. Chen, M. Collins, R. Coulsen, R. Davies, P. S. Dyer, M. Far- man, N. Fedorova, N. Fedorova, T. V. Feldblyum, R. Fischer, N. Fosker, A. Fraser, J. L. Garc´ıa,M. J. Garc´ıa,A. Goble, G. H. Goldman, K. Gomi, S. Griffith-Jones, R. Gwilliam, B. Haas, H. Haas, D. Harris, H. Horiuchi, J. Huang, S. Humphray, J. Jim´enez,N. Keller, H. Khouri, K. Kitamoto,

58 T. Kobayashi, S. Konzack, R. Kulkarni, T. Kumagai, A. Lafton, J. P. Latg´e,W. Li, A. Lord, C. Lu, W. H. Majoros, G. S. May, B. L. Miller, Y. Mohamoud, M. Molina, M. Monod, I. Mouyna, S. Mulligan, L. Mur- phy, S. O’Neil, I. Paulsen, M. A. Pe˜nalva, M. Pertea, C. Price, B. L. Pritchard, M. A. Quail, E. Rabbinowitsch, N. Rawlins, M. A. Rajan- dream, U. Reichard, H. Renauld, G. D. Robson, S. Rodriguez De Cor- doba, J. M. Rodr´ıguez-Pe˜na,C. M. Ronning, S. Rutter, S. L. Salzberg, M. Sanchez, J. C. S´anchez-Ferrero, D. Saunders, K. Seeger, R. Squares, S. Squares, M. Takeuchi, F. Tekaia, G. Turner, C. R. Vazquez De Al- dana, J. Weidman, O. White, J. Woodward, J. H. Yu, C. Fraser, J. E. Galagan, K. Asai, M. Machida, N. Hall, B. Barrell, and D. W. Denning. Genomic sequence of the pathogenic and allergenic filamentous fungus As- pergillus fumigatus. Nature, 438(7071):1151–1156, 2005. ISSN 14764687. doi: 10.1038/nature04332.

H. J. Pel, J. H. de Winde, D. B. Archer, P. S. Dyer, G. Hofmann, P. J. Schaap, G. Turner, R. P. de Vries, R. Albang, K. Albermann, M. R. Andersen, J. D. Bendtsen, J. a. E. Benen, M. van den Berg, S. Breestraat, M. X. Caddick, R. Contreras, M. Cornell, P. M. Coutinho, E. G. J. Danchin, A. J. M. Debets, P. Dekker, P. W. M. van Dijck, A. van Dijk, L. Dijkhuizen, A. J. M. Driessen, C. D’Enfert, S. Geysens, C. Goosen, G. S. P. Groot, P. W. J. de Groot, T. Guillemette, B. Henrissat, M. Herweijer, J. P. T. W. van den Hombergh, C. a. M. J. J. van den Hondel, R. T. J. M. van der Heijden, R. M. van der Kaaij, F. M. Klis, H. J. Kools, C. P. Kubicek, P. a. van Kuyk, J. Lauber, X. Lu, M. J. E. C. van der Maarel, R. Meulenberg, H. Menke, M. a. Mortimer, J. Nielsen, S. G. Oliver, M. Olsthoorn, K. Pal, N. N. M. E. van Peij, A. F. J. Ram, U. Rinas, J. a. Roubos, C. M. J. Sagt, M. Schmoll, J. Sun, D. Ussery, J. Varga, W. Vervecken, P. J. J. van de Vondervoort, H. Wedler, H. a. B. W¨osten,A.-P. Zeng, A. J. J. van Ooyen, J. Visser, and H. Stam. Genome sequencing and analysis of the versatile cell factory Aspergillus niger CBS 513.88. Nature biotechnology, 25(2):221–31, feb 2007. ISSN 1087-0156. doi: 10.1038/nbt1282. URL http://www.ncbi.nlm.nih.gov/pubmed/17259976.

L. M. Petersen, T. T. Bladt, C. D¨urr,M. Seiffert, J. C. Frisvad, C. H. Got- fredsen, and T. O. Larsen. Isolation, structural analyses and biological activity assays against chronic lymphocytic leukemia of two novel cytocha- lasins - Sclerotionigrin A and B. Molecules, 19(7):9786–9797, 2014a. ISSN 14203049. doi: 10.3390/molecules19079786.

L. M. Petersen, C. Hoeck, J. C. Frisvad, C. H. Gotfredsen, and T. O. Larsen. Dereplication guided discovery of secondary metabolites of mixed biosyn-

59 thetic origin from Aspergillus aculeatus. Molecules, 19(8):10898–10921, 2014b. ISSN 14203049. doi: 10.3390/molecules190810898.

P. Pons and M. Latapy. Computing communities in large networks using random walks. Physics and Society, page arXiv:physics/0512106, 2005. ISSN 07403194. doi: 10.1007/11569596. URL http://arxiv.org/abs/ physics/0512106.

I. F. Purchase and J. J. Van der Watt. Carcinogenicity of sterigmatocystin to rat skin. Toxicology and Applied Pharmacology, 26(2):274–281, 1973. ISSN 10960333. doi: 10.1016/0041-008X(73)90262-7.

K. Qiao, Y. H. Chooi, and Y. Tang. Identification and engineering of the cytochalasin gene cluster from Aspergillus clavatus NRRL 1. Metabolic Engineering, 13(6):723–732, 2011. ISSN 10967176. doi: 10.1016/ j.ymben.2011.09.008.

H. Rafiei, P. Dehghan, K. Pakshir, M. C. Pour, and M. Ak- bari. The concentration of aflatoxin M1 in the mothers’ milk in Khorrambid City, Fars, Iran. Advanced biomedical research, 3:152, 2014. ISSN 2277-9175. doi: 10.4103/2277-9175.137859. URL http://www.ncbi.nlm.nih.gov/pubmed/25221755http:// www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4162072.

C. Rausch, T. Weber, O. Kohlbacher, W. Wohlleben, and D. H. Huson. Specificity prediction of adenylation domains in nonribosomal peptide syn- thetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Research, 33(18):5799–5808, 2005. ISSN 03051048. doi: 10.1093/nar/gki885.

C. Rausch, I. Hoof, T. Weber, W. Wohlleben, and D. H. Huson. Phylogenetic analysis of condensation domains in NRPS sheds light on their functional evolution. BMC evolutionary biology, 7:78, 2007. ISSN 14712148. doi: 10.1186/1471-2148-7-78.

A. Rubinstein, Y. Lurie, L. Groskop, and M. Weintrob. Cholesterol-Lowering Eeffects of a 10 mg Daily Dose of Lovastatin in Patients with Initial Total Cholesterol Levels 200 to 240 mg/dl (5.18 to 6.21 mmol/liter). American Journal of Cardiology, 68(11):1123–1126, 1991.

W. G. Sorenson, J. D. Tucker, and J. P. Simpson. Mutagenicity of the tetramic mycotoxin cyclopiazonic acid. Applied and Environmental Micro- biology, 47(6):1355–1357, 1984. ISSN 00992240.

60 T. Stachelhaus, H. D. Mootz, and M. A. Marahiel. The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. Chem- istry and Biology, 6(8):493–505, 1999. ISSN 10745521. doi: 10.1016/S1074- 5521(99)80082-9.

S. Suda and R. W. Curtis. Antibiotic properties of malformin. Applied microbiology, 14(3):475–476, 1966. ISSN 00036919.

V. Valiante, D. J. Mattern, A. Sch¨uffler,F. Horn, G. Walther, K. Scherlach, L. Petzke, J. Dickhaut, R. Guthke, C. Hertweck, M. Nett, E. Thines, and A. A. Brakhage. Discovery of an Extended Austinoid Biosynthetic Pathway in Aspergillus calidoustus. ACS Chemical Biology, 12(5):1227–1234, 2017. ISSN 15548937. doi: 10.1021/acschembio.7b00003.

H. van Liempt, H. von D¨ohren, and H. Kleinkauf. delta-(L-alpha- aminoadipyl)-L-cysteinyl-D-valine synthetase from Aspergillus nidulans. The first enzyme in penicillin biosynthesis is a multifunctional peptide synthetase. The Journal of biological chemistry, 264(7):3680–4, mar 1989. ISSN 0021-9258. URL http://www.ncbi.nlm.nih.gov/pubmed/2645274.

J. Wang, Z. Jiang, W. Lam, E. A. Gullen, Z. Yu, Y. Wei, L. Wang, C. Zeiss, A. Beck, E. C. Cheng, C. Wu, Y. C. Cheng, and Y. Zhang. Study of malformin C, a fungal source cyclic pentapeptide, as an anti- cancer drug. PLoS ONE, 10(11):1–19, 2015a. ISSN 19326203. doi: 10.1371/journal.pone.0140069.

J. Wang, Z. Jiang, W. Lam, E. A. Gullen, Z. Yu, Y. Wei, L. Wang, C. Zeiss, A. Beck, E. C. Cheng, C. Wu, Y. C. Cheng, and Y. Zhang. Study of malformin C, a fungal source cyclic pentapeptide, as an anti- cancer drug. PLoS ONE, 10(11):1–19, 2015b. ISSN 19326203. doi: 10.1371/journal.pone.0140069.

K. J. Weissman. The structural biology of biosynthetic megaenzymes. Nature Chemical Biology, 11(9):660–670, 2015. ISSN 1552-4450. doi: 10.1038/ nchembio.1883. URL http://dx.doi.org/10.1038/nchembio.1883.

A. J. Wright. The Penicillins. Mayo Clinic Proceedings, 74(3): 290–307, mar 1999. ISSN 00256196. doi: 10.4065/74.3.290. URL http://www.ncbi.nlm.nih.gov/pubmed/10090000http: //linkinghub.elsevier.com/retrieve/pii/S0025619611638676.

G. Yang. A Polyketide Synthase Is Required for Fungal Virulence and Production of the Polyketide T-Toxin. the Plant Cell Online, 8(11):

61 2139–2150, 1996. ISSN 10404651. doi: 10.1105/tpc.8.11.2139. URL http://www.plantcell.org/cgi/doi/10.1105/tpc.8.11.2139.

C.-S. Yun, T. Motoyama, and H. Osada. Biosynthesis of the mycotoxin tenuazonic acid by a fungal NRPS-PKS hybrid enzyme. Nature com- munications, 6:8758, 2015. ISSN 2041-1723. doi: 10.1038/ncomms9758. URL http://www.nature.com/ncomms/2015/151027/ncomms9758/full/ ncomms9758.html.

X. Zhou, X.-X. Shen, C. T. Hittinger, and A. Rokas. Evaluating Fast Max- imum Likelihood-Based Phylogenetic Programs Using Empirical Phyloge- nomic Data Sets. Molecular Biology and Evolution, 2017. ISSN 0737- 4038. doi: 10.1093/molbev/msx302. URL http://academic.oup.com/ mbe/advance-article/doi/10.1093/molbev/msx302/4644721.

62 Chapter 2

The genomes of Aspergillus section Nigri reveal drivers in fungal speciation

In our effort to genome sequence the whole genus Aspergillus we started with section Nigri. Characterizing the genome dynamics on the section, clade and isolate level will support the development of pipelines which can describe a genus. Aspergillus section Nigri contains many species relevant to biotech- nology and biomedicine — making it an ideal use case. In this study we devel- oped new tools to describe these fungi using comparative genomics. Core-pan genome analysis and generating homologous groups of secondary metabolite gene clusters show the high genetic versatility of the section. Furthermore, we examined CAZymes to investigate the carbohydrate degrading potential of this species. Our results lets us characterize genetic diveristy over related species and establish a model of three stages of fungal speciation: acquisi- tion, diversification and consolidation. In summary, we established pipeline will facilitate characterization of new de novo sequenced fungi.

63

1 The genomes of Aspergillus section Nigri

2 reveal drivers in fungal speciation 3 4 Tammi C. Vesth (1), Jane L. Nybo (1), Sebastian Theobald (1), Jens C. Frisvad (1), Thomas O. 5 Larsen (1), Kristian F. Nielsen (1), Jakob B. Hoof (1), Julian Brandl (1), Asaf Salamov (3), Robert 6 Riley (3), Morten T. Nielsen (1), Ellen K. Lyhne (1), Martin E. Kogle (1), Kimchi Strasser (9), Erin 7 McDonnell (9), Kerrie Barry (3), Alicia Clum (3), Cindy Chen (3), Matt Nolan (3), Laura Sandor (3), 8 Alan Kuo (3), Anna Lipzen (3), Matthieu Hainaut (4,5), Elodie Drula (4,5), Adrian Tsang (9), Bernard 9 Henrissat (4,5,6), Ad Wiebenga (7), Miia R. Mäkelä (7,8), Ronald P. de Vries (7), Igor V. Grigoriev 10 (3), Uffe H. Mortensen (1), Scott E. Baker (2)*, Mikael R. Andersen (1)*. 11 12 1) Department of Biotechnology and Bioengineering, Technical University of Denmark, Kgs. Lyngby, 13 Denmark 14 2) US Department of Energy Joint Bioenergy Institute, Berkeley, CA, USA 15 3) US Department of Energy Joint Genome Institute, Walnut Creek, CA, USA 16 4) Architecture et Fonction des Macromolécules Biologiques, (CNRS UMR 7257, Aix-Marseille 17 University, 13288 Marseille, France. 18 5) INRA, USC 1408 AFMB, 13288 Marseille, France. 19 6) Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia 20 7) Fungal Physiology, Westerdijk Fungal Biodiversity Institute & Fungal Molecular Physiology, 21 Utrecht University, Utrecht, The Netherlands 22 8) Department of Food and Environmental Sciences, Division of Microbiology and Biotechnology, 23 University of Helsinki, Finland 24 9) Centre for Structural and Functional Genomics, Concordia University, Montreal, Quebec, Canada 25 *Corresponding authors

26 Abstract

27 Aspergillus section Nigri comprises filamentous fungi relevant to biomedicine, bioenergy, health, and 28 biotechnology. In order to learn more about fungal speciation, as well as potential for applications in 29 biotechnology and biomedicine, we sequenced 23 genomes de novo, forming a full genome 30 compendium for the section (26 species), as well as six A. niger isolates. Comparative genomics of 31 38 fungal genomes allowed us to assess inter- and intra-species genomic variation. We predicted 32 17,903 CAZymes and 2,717 secondary metabolite gene clusters, which we condensed into 455 33 distinct families corresponding to compound classes, 49% of which are only found in single species. 34 We performed metabolomics and genetic engineering to correlate genotypes to phenotypes, as 35 demonstrated for the metabolite aurasperone, and by heterologous transfer of citrate production to 36 A. nidulans. All analyses supported a role in speciation for secondary metabolism and regulators 37 and allowed us to propose a three-step model for fungal speciation.

38 Keywords

39 Genomics; Aspergillus; Nigri; Primary Metabolism; Secondary Metabolism; CAZyme

40 Introduction

41 Species in genus Aspergillus are of broad interest to medical1, applied2,3, and basic research4. 42 Members of Aspergillus section Nigri (“black Aspergilli”) are prolific producers of native and 43 heterologous proteins5,6, organic acids (in particular64 citric acid7–9), and secondary metabolites

1

44 (including biopharmaceuticals and mycotoxins like ochratoxin A). Furthermore, the section members 45 are generally very efficient producers of extracellular enzymes10,11, they are the production 46 organisms for 49 out of 260 industrial enzymes12,13. Among the most important of these, in addition 47 to A. niger, are A. tubingensis, A. aculeatus, and A. luchuensis (previously A. acidus, A. kawachii, 48 and A. awamori14–16). 49 50 Members of Aspergillus section Nigri are also known as destructive degraders of foods and feeds, 51 and some isolates produce the potent mycotoxins ochratoxin A17 and fumonisins18–20. In addition, 52 some species in this section have been proposed to be pathogenic to humans and other animals21. 53 It is thus of interest to further examine section Nigri for industrial exploitation, prevention of food 54 spoilage, toxin production, and pathogenicity caused by these fungi. 55 56 A combined phylogenetic and phenotypic approach has revealed that section Nigri contains at least 57 27 species22–26. Recent results have shown that the section contains species with a high diversity 58 and may consist of two separate clades: the biseriate species and the uniseriate species27, which 59 show differences in sexual states28, sclerotium formation29, and secondary metabolite production30. 60 In the section, only six species have had their genome sequenced: A. niger31,32, A. luchuensis16,33, 61 A. carbonarius34, A. aculeatus34, A. tubingensis34, and A. brasiliensis34. 62 63 This section, with its combination of fungal species with diverse impact on humanity and species- 64 richness, is thus an interesting target for studying how fungi diversify into species. In this study, we 65 have de novo sequenced the genomes of 20 species of section Nigri, thus completing a genome 66 compendium of 26 described species in the section. Further, we have genome-sequenced three 67 additional A. niger isolates (including two previously described as species A. lacticoffeatus11 and A. 68 phoenicis35), which in combination allows for inter- and intra-species comparison of 32 isolates. The 69 development of methods for comparative genomics, combined with experimental analysis of the 70 species, allows us to track genetic diversity across genomes, from the protein level, over the 71 evolution of biosynthetic gene clusters, to the groups of genes which define clades or individual 72 species. This analysis in conjunction with the high resolution in genome sequences allows us to 73 propose a hypothesis on fungal species evolution and diversification.

74 Results and Discussion

75 Analyzing 23 new genomes reveals high genetic diversity of Aspergillus section Nigri 76 We present 23 whole genome sequences: 20 genomes of section Nigri species previously 77 unsequenced and three additional A. niger genomes for assessment of intraspecies diversity. All 78 genomes were sequenced, assembled, and annotated using the JGI fungal genome pipeline36,37 (SI 79 1). Figure 1 shows a phylogenetic tree as well as gene richness, number of scaffolds, and functional 80 annotation (InterPro38,39). The tree supports previous proposals11,35 that A. lacticoffeatus and A. 81 phoenicis are indeed synonyms of A. niger. 82 83 In comparing key statistics of the genomes, some traits are quite similar, and others surprisingly 84 variable. Many of the investigated species have around the average number of genes (11,900), but 85 there is considerable variation from the smallest number of predicted genes (10,066) to the highest 86 (13,687). The smallest number of predicted genes in section Nigri is found in A. saccharolyticus 87 which supports the previous observation40,41 that this species is quite atypical in section Nigri. 88 89 65

2

90 91 Figure 1. Dendrogram and bubble plots illustrating phylogenetic distances between 32 genomes from section 92 Nigri as well as four non-Nigri Aspergillus species, a Penicillium and a Neurospora genome (for outgroups). 93 Additional information is available in SI 1. A) Phylogenetic tree created using RAxML51, MAFFT52, and 94 Gblocks53, based on 2,022 conserved genes. Plate growth pictures are presented for each newly sequenced 95 species. B) Colors indicate whether the organism is from this or another sequencing project. C) Five bubble 96 plots of descriptive numbers for each genome. The bubble sizes have been scaled to the categories and are 97 not comparable across categories. 98 99 We further evaluated the annotation of the 23 genome sequences we generated. The percentage of 100 complete genes (including start and stop codon) is in the range of 94-98%, and 67% of the proteins 101 could be assigned one or more InterPro domains. The number of scaffolds (average 166) vary from 102 47 in A. piperis to 518 in A. ellipticus. On average, 70% of the proteins had sequence homologs in 103 Swissprot (91% of proteins have homologs within section Nigri, see next section). This means that 104 even though six members of section Nigri have already been sequenced, ~30% of each of the new 105 genomes are not found in Swissprot, which demonstrates the large genetic diversity in the section. 106 107 Constructing the pan- and core-genome of section Nigri shows genome flexibility and 108 many species-unique genes 109 110 Given the genetic diversity in section Nigri, we were interested in examining the extent and timing of 111 the genetic diversification events. For this analysis, we focused on three conceptual groups of genes: 112 1) The pan-genome: all genes present in one or more species. 2) The core-genome: genes present 113 in all included species including paralogs. This set is expected to encode cellular functions needed 66

3

114 for all species. 3) Species-unique genes: found in only one species with or without paralogs. These 115 genes are expected to be involved in environmental adaptation and speciation. 116 117 We first identified orthologs and paralogs using a BLASTP-based pipeline with reciprocal cut-offs 118 specific to the Aspergillus genus (SI 2). Groups of homologous proteins are referred to as families. 119 Figure 2A–C shows the overall genetic diversity between 38 fungal species from closely related 120 genera, 36 species within the Aspergillus genus and 32 species from section Nigri.

121 122 Figure 2. Genetic diversity between fungal species from closely related genera, species within the Aspergillus 123 genus and section Nigri. (A-C) Histogram representation of the total number of genes and families distributed 124 over the pan (green), core (blue), and unique (orange) genome for (A) the 38 fungal genomes of this study, 125 (B) the 36 Aspergillus genus genomes and, (C) the 32 genomes in section Nigri. (D) Dendrogram of the 126 phylogenetic relation between the 32 species in section Nigri. The black nodes represent homolog-families 127 found only in the species branching from the nodes. The white boxes represent the genes unique to the specific 128 species. (E) Stacked histograms of the gene count, the core-genome as well as unique genes for each species, 129 numbers show total number of genes. (full colours: InterPro annotation, light colours: no annotation). 130 131 The Aspergillus genus pan-genome comprises 433,116 genes across the 36 Aspergillus genomes 132 and from this, 62,996 gene families were constructed. 6% of those are found in all genomes (3,769 133 core families) while 9% are genes without orthologs in the other genomes (40,424 unique genes, 134 39,929 unique families) (Figure 2B). We also found evidence of potential gene transfers between 135 species of this section, as 23% of the pan-gene families are not present in groups of species fitting 136 the phylogenetic tree, hence indicating transfer of genes from one branch to another. 137 138 We further performed an analysis defining the number67 of core-gene families in section Nigri and in 139 all sub-clades hereof (Figure 2D). The core-genome of section Nigri was found to be 32% larger than

4

140 that of the genus (4,983 families relative to 3,769, Figure 2B–C). Conversely, 9% are unique to a 141 specific species (32,378 unique genes in 32,036 families, Figure 2C). The fraction of genes unique 142 to a species is similar within the section and across the genus, meaning that adding a new section 143 Nigri genome adds as many new genes as adding a more distantly related Aspergillus. This is rather 144 interesting and shows a generally high genetic diversity of genus Aspergillus. Such a tendency could 145 be the result of over-predicting genes, considering the low rate of InterPro annotation in the unique 146 genes (Figure 2E); however, it could also be the result of horizontal gene transfer from 147 uncharacterized species. 148 149 The section Nigri core genome contains signature genes of the black Aspergilli 150 151 To associate biological functions to the pan-, core-, and unique genomes, and genes exclusive to 152 only members of the black Aspergilli, we employed the InterPro database39. Examining the core- 153 genome of 38 fungal genomes (Figure 2A), only 4.5% of the genes lack InterPro domains (SI 3, 4.1), 154 indicating – as would be expected – that the core-genes across closely related fungal genera include 155 generally known and conserved functions. For the pan-genome of the 36 Aspergillus species versus 156 the section Nigri pan genome, the percentage of unknown function is similar (32% vs 33%, SI 4.4, 157 6.1), as are the numbers for the core-genomes (14% vs 17%, Figure 2E, SI 4.4, 6.1). General 158 functions like transporters, regulators, organelle specific proteins, primary metabolism, and structural 159 domains were found as core-features across all 36 Aspergilli (SI 4.6), which supports the general 160 validity of the method. 161 162 We expected the section Nigri core-genome (gene families only found in all the species of section 163 Nigri and not in other Aspergilli examined) to contain Nigri signature genes, and found this to be the 164 case. These families contain 580 InterPro domains conserved to a varying degree, including a high 165 number of genes involved in the saprotropic lifestyle and secondary metabolism (SI 7). 166 167 Unique genes involved in secondary metabolism and regulation appear to be 168 common drivers of speciation 169 The genetic diversity seen in section Nigri led us to investigate whether the unique genes for each 170 species could provide clues to the cause of fungal speciation. While these genes by definition do not 171 have homologs in other species, we can predict general functions using InterPro domains. Unique 172 genes of species in section Nigri matched 1,334 different InterPro domains (SI 8.3). Within this, we 173 searched for common functions in all genes unique to section Nigri species (excluding six A. niger 174 isolates due to intraspecies variation). Surprisingly, we only identified ten domains which were found 175 in nearly all Nigri species (25–26 species). Notably, nine in ten are related to functions involved in 176 secondary metabolites or gene or protein regulation (SI 9). Finding these functions in nearly all sets 177 of species-specific genes suggests that secondary metabolite production and regulatory proteins are 178 drivers in fungal speciation. 179 180 Intra-species genetic variation is similar to inter-species variation 181 We were interested in comparing the diversity between isolates of the same species to that of the 182 diversity of species in the same clade. We thus compared six A. niger isolates to the eight closely 183 related species in the A. tubingensis clade (Figure 2D). The A. niger isolates have a high degree of 184 genetic homogeneity as 80% of the A. niger pan-genome is conserved across the six isolates and 185 only 6% is unique to any of the isolates (SI 10A). The same scale is seen in the A. tubingensis clade 186 (77% shared pan-genome, 7% unique, SI 10B). Moreover, the percentage of genes with predicted 187 functional domains within the two groups is similar to68 that of section Nigri as a whole (SI 11, 6.1, 188 12.1, 12.4). For the unique genes of the two groups, these are largely of unknown function (A.

5

189 tubingensis clade 82%, A niger complex 86%, SI 12.1, 12.4, 13.1, 13.2). The functions of the A. niger 190 core-genome (3,798 domains) are, not surprisingly, very similar to that of section Nigri as a whole 191 (SI 6.3, 12.3). In summary, the inter-species variation in the A. tubingensis clade is of the same scale 192 as the intra-species variation in the A. niger isolates, showing that a large genetic variation does not 193 directly translate to the currently circumscribed species. 194 195 Acid-producing species have extra citrate synthase genes which confer increased 196 citrate production in A. nidulans 197 198 As species of section Nigri are known organic acid-producers, the genes involved in central 199 metabolism are of interest; as the cause of citric acid overproduction in several of the section 200 members is still not identified42. We thus analysed the number of paralogs in central carbon 201 metabolism in our set of 38 fungal genomes using a curated version of an A. niger genome-scale 202 metabolic model43 as a source of pathway annotation (Figure 3). 203

204 205 Figure 3. Gene family sizes for 31 genomes from section Nigri and six reference species. Proteins putatively 206 catalyzing the same metabolic step (i.e. complex subunits or functionally related enzymes) have been grouped 207 in grey boxes. Species are sorted according to the phylogeny of Figure 1. A) Glycolysis isoenzymes. B) TCA 208 cycle isoenzymes. C) Growth of strains on CREA medium, which turns yellow with lowered pH. Note that A. 209 niger CBS 513.88 has been mutagenized31. 210 211 The analysis of paralogs in glycolysis shows very little variance across the 32 Nigri genomes but 212 varies compared to the six other fungal species (Figure 3A). For the TCA cycle (Figure 3B), it is 213 evident that certain metabolic steps in the pathway69 are conserved throughout all species, while

6

214 others vary in paralog numbers. The biseriates are particularly homogeneous. These, along with four 215 uniseriates, are also the primary citric acid-producing species in the section (Figure 3C). 216 217 Of particular interest is the citrate production phenotype, and thus citrate synthase. All biseriates 218 have one extra citrate synthase, the four acid-producing uniseriates have two extra. Sequence 219 alignment revealed three distinct types (SI 14), two of which are mitochondrial and found in all 220 species. All extra citrate synthase paralogs are of the third type, predicted to be cytosolic. We 221 identified the extra biseriate citrate synthase (citB44) in a conserved 30 kB gene cluster including two 222 transcription factors, a transporter and two putative fatty acid synthases. We performed heterologous 223 expression of the A. niger citB gene cluster in A. nidulans (which only has the two mitochondrial 224 citrate synthases) using two constitutive promoters to control the transcription factors. This 225 expression increased citrate concentrations by 42-52% (SI 15). We hypothesize that this gene 226 cluster may have a particular role in citrate production and additional undescribed functions involving 227 the fatty acid synthase-like genes. 228 229 Species-specific carbon utilization is not correlated with CAZyme content, but may 230 be determined by CAZyme regulation 231 Aspergilli have a particularly broad ability to degrade and convert plant biomass34. It is thus essential 232 to examine the species diversity of this trait. We predicted the CAZy gene content of the genomes 233 across section Nigri (17,903 CAZy domains, Figure 4, SI 16) and performed growth profiling on plant 234 biomass-related carbon sources (SI 17). Growth on D-glucose was used to evaluate relative growth, 235 showing variation between species.

236 237 Fig. 4. Comparison of CAZy gene content divided by target polysaccharide. Details on CAZy families are 238 available in SI 16. Growth profiles are available in SI 17. 70

7

239 All black Aspergilli grew well on pectin and have a highly conserved and extensive set of genes 240 encoding pectin-active enzymes. Growth on other plant polysaccharides such as xylan, starch, and 241 guar gum is more variable, despite highly conserved genes related to xyloglucan and starch 242 degradation. The growth and genetic variability on inulin is particularly high, nine species showed 243 reduced growth. Moreover, endoinulinase (GH32 INU) is only present in eight of the black Aspergilli 244 while the remainder of inulin-related genes (GH32 INV and INX) are more commonly present (SI 245 16). However, the growth phenotypes show no correlation with the gene content (SI 17). 246 247 In a previous study11, enzyme levels were measured in several black Aspergilli and significant 248 differences were found. However, these differences do not reflect the genetic differences seen here 249 (SI 16). Considering the relative uniformity of the CAZyme content (Figure 4), no correlation between 250 genome content and growth on plant biomass-related carbon sources (SI 17) was observed for the 251 black Aspergilli, suggesting that the differences in the capability of plant biomass degradation reflect 252 gene expression levels in the individual fungus. This confirms a proteome study of less-related 253 Aspergilli where the different response to plant biomass appeared to be mainly at the regulatory 254 level45. The data suggest that this is the case for section Nigri: Species-specific phenotypes is not 255 generally driven by CAZyme content in closely related species, but by regulation. 256 257 Section-level analysis of secondary metabolism identifies 2,717 gene clusters in 455 258 distinct families 259 260 Secondary metabolism (SM) is thought to be the basis of chemical defence, virulence, toxicity, 261 mineral uptake and communication in fungi46 and have a wide range of potential medical 262 applications. As it was also suggested above to be involved in speciation, we examined the 263 exometabolite diversity of 37 Aspergillus and Penicillium species based on predictions of SM gene 264 clusters (SMGCs) as well as chemical profiles of the species of section Nigri on multiple substrates. 265 266 We identified 2,717 SMGCs in the 37 genomes. This is an even higher number of SMGCs per 267 species than a previous study found in 24 Penicillium genomes47. We were further interested in 268 quantifying the actual diversity of the SMGCs in section Nigri and to analyse presence patterns of 269 SMGCs across species. We therefore defined SMGC “families” as genetically similar SMGCs across 270 genomes (SI 2). Each SMGC family is expected to produce the same or similar compounds. This 271 clustering resulted in the definition of 455 SMGC families across the 37 genomes (SI 19), indicating 272 the potential production of 455 different chemical families. Most (82%) are found in less than 10 273 organisms, and 49% contain only one cluster (SI 20 shows examples). This is on average 8.75 274 unique clusters per species, despite the close phylogenetic distance of the section. 275 276 Phylogenetic examination of secondary metabolite families shows dynamics and 277 supports involvement in speciation 278 279 To understand more about how SMGCs are exchanged between species, each of the 455 SMGC 280 families was characterized by the type of backbone enzyme and analysed according to the 281 phylogeny (Figure 5A–B). Only five out of all SMGCs were present in all analysed species including 282 clusters for the NRPS-derived siderophore ferrichrome, the circular NRP fungisporin48/nidulanin A49 283 and pigment (YWA1) synthesis. Two shared SMGC families were false predictions, namely two fatty 284 acid synthases. 285 286 71 287

8

288 289 Figure 5. Shared secondary metabolism gene clusters (SMGC) and types of cluster backbones for section 290 Nigri. A) Cladogram of 36 Aspergillus (32 Nigri) genomes and one Penicillium genome. Numbers at nodes 291 show SMGC families shared by species in the branch. End points show families unique to an organism. B) 292 Total SMGC family profile of the organism classified by backbone enzyme type. C) Presence/absence map of 293 detected secondary metabolites in section Nigri species (SI 2). Strains with no available metabolite data are 294 marked in light gray. DMAT: Dimethylallyltransferase (Prenyl transferases), HYBRID: Gene containing 295 domains from NRPS and PKS backbones, NRPS: Non-ribosomal peptide synthetase, NRPS-like: Non- 296 ribosomal peptide synthetase like containing at least two NRPS specific domains and another domain or one 297 NRPS A domain in combination with NAD_binding_4 domain or short chain dehydrogenase., PKS: Polyketide 298 synthase, PKS-like: Polyketide synthase like, containing at least two PKS specific domains and another 299 domain, TC: Terpene cyclase. 300 301 Examining the dynamics of the families, only 32% and 19% of SMGCs found in two or three 302 organisms respectively follow the whole-genome phylogeny and suggest intra-genus horizontal gene 303 transfer (HGT) or SMGC loss to be relatively common. As an example, an SMGC is found in five 304 distantly related species (SI 20B). 305 306 As seen in Figure 5A, the presence of unique SMGC families at every major branch point in the 307 phylogenetic tree supports that SMGC evolution is a driver in speciation: All biseriates share a 308 previously undescribed PKS and an NRPS-like, the A. carbonarius clade a terpene cyclase, and the 309 remaining biseriates share another PKS and an NRPS-like. Furthermore, the A. niger complex and 310 A. tubingensis clade each share unique PKS genes. Uniseriates share four unique previously 311 undescribed SMGC families (Figure 5A). Examining the individual species, every single section Nigri 312 species with the exception of A. tubingensis has a unique set of SMGCs. Conversely, all six A. niger 313 isolates all have the same set of SMGCs. These patterns show the high diversity of secondary 314 metabolism as well as support the role of these genes in speciation. 315 316 72

9

317 Correlation of the exometabolome with SMGC families pinpoints SMGC candidates 318 As a further application of the constructed SMGC families, we hypothesized that we can correlate 319 SMGC families to classes of compounds. We performed extensive exometabolome analysis of 27 320 of the sequenced strains, and identified 35 compound families (Figure 5C, SI 21). 321 322 The most abundant group was naphtha-gamma-pyrones, of which aurasperone B30 was identified in 323 14 of the isolates. We compared the presence patterns of SMGC families with the compound class 324 and combined it with a knowledge-based filtering of InterPro domains leaving one hit (SI 20D). The 325 candidate SMGC family is a nine-gene cluster found in 18 genomes – including the 14 where we 326 detect the compound – and it contains all activities needed to synthesize aurasperone. As a support 327 of this identification, a SMGC for a closely related compound, aurofusarin, has been experimentally 328 verified in Fusarium graminearium50. The aurasperone cluster shares six genes (whereof one is a 329 duplication) with the aurofusarin cluster. This supports the assignment of this family of SMGCs to 330 the production of aurasperone B, and conceptually supports this approach for efficient linking of 331 clusters to compounds. We see this correlation approach as highly useful for future elucidations of 332 fungal metabolites. 333 334 A proposed model for fungal speciation 335 Based on our analysis above, we note that both secondary metabolites and regulatory genes would 336 have a phenotype in the host right after acquisition, making them more likely to drive diversification 337 and ultimately speciation. We speculate that this happens in three steps: 1) Acquisition: One or more 338 genes provide superior fitness or a competitive advantage. 2) Diversification: A growth advantage 339 allows the genome to diverge without strong selective pressure. The diverse A. niger genomes could 340 be seen as being in this stage (Figures 1, 2, and 5). 3) Consolidation: Traits are fixed in the 341 population, and a species with a new lifestyle can proliferate (as seen in the nodes of Figure 2D and 342 the SMGC analysis below). 343 344 Conclusion 345 We have sequenced the genomes of a whole section of filamentous fungi, and a diverse set of A. 346 niger isolates, and found that the species are highly diverse on some traits, in particular secondary 347 metabolism, and homogeneous on others, such as glycolytic metabolism, and CAZymes. The 348 presented data furthermore provides an extensive compendium of 24 new genomes which provides 349 substantial information on fungal genetic diversity. We further combined genome analysis with 350 metabolite profiling and genetic engineering to identify the genetic basis of several phenotypes within 351 primary and secondary metabolism. 352 Of particular interest was the finding that the species-specific genes in all species share functions 353 within gene/protein regulation and secondary metabolism. This suggests that these highly 354 transferable traits are particularly active and efficient drivers of fungal speciation.

355 Acknowledgements

356 TCV, JLN, ST, and MRA acknowledge funding from The Villum Foundation, grant VKR023437. JB 357 and MRA acknowledge funding from the Novo Nordisk Foundation, grant NNF13OC0004831. MRA 358 and TCV acknowledge funding from the Danish National Research Foundation (DNRF137) for the 359 Center for Microbial Secondary Metabolites. JCF acknowledge funding from the Novo Nordisk 360 Foundation, grant NNF13OC0005201. The work conducted by the U.S. Department of Energy Joint 361 Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of 362 the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Inge Kjaerboelling is 363 acknowledged for critical and constructive feedback73 on the manuscript and figures.

10

364 Supplementary Information

365 SI 1 366 Genome statistics of sequencing and annotation. Absence of information is indicated by a dash. 367 Details for column contents are as follows: nrHits: proteins which had a significant hit in Swissport 368 or NCBI non-redundant database. completeCDS : gene predictions containing both start and stop 369 codon. 75percentCovByDNA: genes with >= 75% coverage of exons by RNA. 370 totalExonsCovByRNA: genes with 100% coverage of exons by RNA. totalGenes: total number of 371 predicted genes. repeatRegions: number of Repeat-covered regions. length_repeatRegions: total 372 length of repeat-covered regions. seqType: sequencing setup. assemblyType: assembly pipeline. 373 study: genome sequenced in this or other studies.

374 SI 2 375 Materials, Methods, and Protocols

376 SI 3

377 Dendrogram and InterPro coverage of the 36 Aspergillus genus species A. Dendrogram of the 378 phylogenetic relation between the 36 species. The black boxes represent the homologous genes 379 among the species branching from the nodes. The white boxes represent the genes unique to the 380 specific species. B. Genome sizes of each species. The black coloured boxes represent the genes 381 annotated by InterPro. C. Core-genome size of each species. The blue coloured boxes represent 382 the genes annotated by InterPro. D. Species unique genes. The yellow coloured boxes represent 383 the genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The 384 numbers at right side of the boxes indicates the total number of annotated and not annotated genes. 385 The bar scales are unique to each graph.

386 SI 4 387 Sheet 1: InterPro coverage of the 38 fungal species. The genome size and the InterPro coverage of 388 each of the 38 fungal species alongside the core and species unique segments of the genome. The 389 gene number representing the core and unique portion of the genomes will adjust relative to the 390 accompanying species. Sheet 2: InterPro domains and functions included in the pan-genome of the 391 38 fungal species. Sheet 3: InterPro domains in the core-genome of the 38 fungal species. 392 Sheet 4: InterPro coverage of the 36 Aspergillus genus species. The genome size and the InterPro 393 coverage of each of the 36 Aspergillus genus alongside the core and species unique segments of 394 the genome. The gene number representing the core and unique portion of the genomes will adjust 395 relative to the accompanying species. Sheet 5: InterPro domains and functions included in the pan- 396 genome of the 36 Aspergillus genus species. Sheet 6: InterPro domains in the core-genome of the 397 36 Aspergillus genus species.

398 SI 5

399 Phylogenetic relation and InterPro coverage of the fungal species included in this study. A. 400 Dendrogram of the phylogenetic relation between the 38 species. The black boxes represent the 401 homologous genes among the species branching from the nodes. The white boxes represent the 402 genes unique to the specific species. B. Genome sizes of each species. The black coloured boxes 403 represent the genes annotated by InterPro. C. Core-genome size of each species. The blue coloured 404 boxes represent the genes annotated by InterPro. D Species unique genes. The yellow coloured 405 boxes represent the genes annotated by InterPro. The74 grey coloured boxes represent genes with no

11

406 annotation. The numbers at right side of the boxes indicates the total number of annotated and not 407 annotated genes. The bar scales are unique to each graph.

408 SI 6 409 Sheet 1: InterPro coverage of section Nigri species. The genome size and the InterPro coverage of 410 each of section Nigri species alongside the core and species unique segments of the genome. The 411 gene number representing the core and unique portion of the genomes will adjust relative to the 412 accompanying species. Sheet 2: InterPro domains and functions included in the pan-genome of 413 section Nigri species. Sheet 3: InterPro domains in the core-genome of the section Nigri species.

414 SI 7 415 The InterPro domains for the core-genome specific to section Nigri. The InterPro domains not found 416 in the core-genome of the non-Nigri Aspergillus species.

417 SI 8 418 InterPro domains in the unique gene set. Sheet 1: InterPro domains in the unique gene set of the 38 419 fungal species in this study. Sheet 2: InterPro domains in the unique gene set of the Aspergillus 420 genus. Sheet 3: InterPro domains in the unique gene set of section Nigri.

421 SI 9 422 InterPro domains potentially involved in general speciation within section Nigri without A. niger 423 isolates. InterPro domains in the unique gene set of section Nigri without the A. niger isolates where 424 the InterPro domain is found in 25 or 26 species out of 26 species.

425 SI 10 426 A. Pan, core, unique genes and families of the Nigri clade species 427 The total number of the proteins and families of all genes in the Tubingensis clade (pan-genome, 428 green), genes shared by all species (core-genome, blue), and genes unique to the individual species 429 (species unique genes, orange). The families were predicted using BLASTp alignments with cutoffs 430 specific to the Aspergillus genus and single linkage clustering designed for this project. Nigri clade 431 species: A. lacticoffeatus, A. niger ATCC 1015, A. niger ATCC 13496, A. niger CBS 513.88, A. niger 432 NRRL3, and A. phoenicis. 433 B. Pan, core, unique genes and families of the Tubingensis clade species 434 The total number of the proteins and families of all genes in the Tubingensis clade (pan-genome, 435 green), genes shared by all species (core-genome, blue), and genes unique to the individual species 436 (species unique genes, orange). The families were predicted using BLASTp alignments with cutoffs 437 specific to the Aspergillus genus and single linkage clustering designed for this project. Tubingensis 438 clade species: A. costaricaensis, A. eucalypticola, A. luchuensis CBS 106.47, A. luchuensis IFO 439 4308, A. neoniger, A. piperis, A. tubingensis and A. vadensis.

440 SI 11 441 Phylogenetic relation and InterPro coverage of the A. tubingensis clade species. A. Dendrogram of 442 the phylogenetic relation between the 8 species. The black boxes represent the homologous genes 443 among the species branching from the nodes. The white boxes represent the genes unique to the 444 specific species. B. Genome sizes of each species. The black coloured boxes represent the genes 445 annotated by InterPro. C. Core-genome size of each species. The blue coloured boxes represent 446 the genes annotated by InterPro. D Species unique genes. The yellow coloured boxes represent the 447 genes annotated by InterPro. The grey coloured boxes75 represent genes with no annotation. The

12

448 numbers at right side of the boxes indicates the total number of annotated and not annotated genes. 449 The bar scales are unique to each graph. 450 Dendrogram and InterPro coverage of the A. niger clade species 451 Phylogenetic relation and InterPro coverage of the A. niger clade species. A. Dendrogram of the 452 phylogenetic relation between the 6 species. The black boxes represent the homologous genes 453 among the species branching from the nodes. The white boxes represent the genes unique to the 454 specific species. B. Genome sizes of each species. The black coloured boxes represent the genes 455 annotated by InterPro. C. Core-genome size of each species. The blue coloured boxes represent 456 the genes annotated by InterPro. D Species unique genes. The yellow coloured boxes represent the 457 genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The 458 numbers at right side of the boxes indicates the total number of annotated and not annotated genes. 459 The bar scales are unique to each graph.

460 SI 12 461 Sheet 1: InterPro coverage of the Niger clade species. The genome size and the InterPro coverage 462 of each of the Niger clade species alongside the core and species unique segments of the genome. 463 The gene number representing the core and unique portion of the genomes will adjust relative to the 464 accompanying species. Sheet 2: InterPro domains and functions included in the pan-genome of the 465 Niger clade species. Sheet 3: InterPro domains in the core-genome of the Niger clade species. 466 Sheet 4: InterPro coverage of the Tubingensis clade species. The genome size and the InterPro 467 coverage of each of the Tubingensis clade alongside the core and species unique segments of the 468 genome. The gene number representing the core and unique portion of the genomes will adjust 469 relative to the accompanying species. Sheet 5: InterPro domains and functions included in the pan- 470 genome of the Tubingensis clade species. Sheet 6: InterPro domains in the core-genome of the 471 Tubingensis clade species.

472 SI 13 473 InterPro domains in the unique gene set. Sheet 1: InterPro domains and functions in the Niger clade 474 unique gene set. Sheet 2: InterPro domains and functions in the Tubingensis clade unique gene set.

475 SI 14 476 Sequence comparison of citrate synthase genes. The sequence of the putative citrate synthases 477 identified in the copy number analysis has been compared using neighbor-joining as implemented 478 in MUSCLE. Potential subcellular locations have been predicted using the TargetP method.

479 SI 15 480 Overview of areas under curve for citrate production using HPLC quantification. Citrate was 481 measured in duplicate experiments for three strains cultivated with and without manganese added 482 to the growth medium.

483 SI 16

484 Table providing overviews of number of Glycoside Hydrolases (GH), Glycosyl Transferases (GT), 485 Polysaccharide Lyases (PL), Carbohydrate Esterases (CE), Carbohydrate Binding Modules (CBM) 486 and Auxiliary Activities (AA) in 37 genomes included in the study.

487 SI 17

488 Comparative growth profiling of the aspergilli from section76 Nigri and reference fungi.

13

489 SI 18

490 Schematic representation of secondary metabolic gene cluster family identification. Protein blast 491 comparisons between all gene cluster members are aggregated into one cluster similarity score. 492 From this, a network is created with gene clusters as nodes (dots) and similarity score as edges 493 (arrows). Subsequently, random walk clustering is used to find communities of nodes inside the 494 network. Green arrows indicate a high probability that nodes will be assigned to one community. Red 495 arrows indicate community borders. Resulting families are containing communities of related SMGC.

496 SI 19 497 Barplot describing SMGC family frequencies. The bars illustrate the presence of a SMGC family in 498 a certain number of organisms. Numbers above bars show the total counts.

499 SI 20 500 Overview of predicted SMGC families (B-D) and their location in the phylogeny. A: Cladogram of 501 species used for secondary metabolic gene cluster analysis in this study. Letter code indicates 502 predicted clusters for fumonisin, aurasperone B and an example gene cluster family predicted 503 exclusively in biseriates. B: Example gene cluster found in five distantly related species. C: Predicted 504 gene cluster family containing a PKS in Aspergillus niger CBS 513.8 similar to FUM 1 from Fusarium 505 oxysporum. D: Predicted gene cluster family for aurasperone B including the aurofusarin gene 506 cluster from Fusarium graminearum50

507 SI 21 508 Overview of SM families detected in 27 different Aspergillus isolates 509 510 511

512 References

513 1. Nierman, W. C. et al. Genomic sequence of the pathogenic and allergenic filamentous fungus 514 Aspergillus fumigatus. Nature 438, 1151–6 (2005). 515 2. Pel, H. J. et al. Genome sequencing and analysis of the versatile cell factory Aspergillus niger 516 CBS 513.88. Nat. Biotechnol. 25, 221–31 (2007). 517 3. Machida, M. et al. Genome sequencing and analysis of Aspergillus oryzae. Nature 438, 1157– 518 61 (2005). 519 4. Galagan, J. E. et al. Sequencing of Aspergillus nidulans and comparative analysis with A. 520 fumigatus and A. oryzae. Nature 438, 1105–1115 (2005). 521 5. Papagianni, M. Advances in citric acid fermentation by Aspergillus niger: biochemical aspects, 522 membrane transport and modeling. Biotechnol Adv 25, 244–263 (2007). 523 6. Punt, P. J. et al. Filamentous fungi as cell factories for heterologous protein production. Trends 524 in Biotechnology 20, 200–206 (2002). 525 7. Currie, J. The citric acid fermentation of Aspergillus niger. J. Biol. Chem. 31, 15–37 (1917). 526 8. Pel, H. J. et al. Genome sequencing and analysis of the versatile cell factory Aspergillus niger 527 CBS 513.88. Nat. Biotechnol. 25, (2007). 528 9. Andersen, M. R. et al. Comparative genomics of citric-acid-producing Aspergillus niger ATCC 529 1015 versus enzyme-producing CBS 513.88. Genome Res. 21, 885–97 (2011). 530 10. Wösten, H., Scoltmeijer, K. & de Vries, R. in Food Mycology: A multifaceted approach to fungi 531 and food (eds. Dijksterhuis, J. & Samson, R.) 183–196 (2007). 532 11. Meijer, M., Houbraken, J. A. M. P., Dalhuijsen, S., Samson, R. A. & de Vries, R. P. Growth 533 and hydrolase profiles can be used as characteristics to distinguish Aspergillus niger and 534 other black aspergilli. Stud. Mycol. 69, 19–30 (2011). 535 12. Association of Manufacturers and Formulators77 of Enzyme Products. List of commercial 536 enzymes. (2009). 14

537 13. Workman, M., Andersen, M. R. & Thykaer, J. Integrated Approaches for Assessment of 538 Cellular Performance in Industrially Relevant Filamentous Fungi. Ind. Biotechnol. 9, 337–344 539 (2013). 540 14. Hong, S.-B. et al. Aspergillus luchuensis, an industrially important black Aspergillus in East 541 Asia. PLoS One 8, e63769 (2013). 542 15. Perrone, G. et al. Aspergillus niger contains the cryptic phylogenetic species A. awamori. 543 Fungal Biol. 115, 1138–1150 (2011). 544 16. Futagami, T. et al. Genome Sequence of the White Koji Mold Aspergillus kawachii IFO 4308, 545 Used for Brewing the Japanese Distilled Spirit Shochu. Eukaryot. Cell 10, 1586–1587 (2011). 546 17. Abarca, M. L., Bragulat, M. R., Castella, G. & Cabanes, F. J. Ochratoxin A production by 547 strains of Aspergillus niger var. niger. Applied and Environmental Microbiology 60, 2650–2652 548 (1994). 549 18. Frisvad, J. C., Smedsgaard, J., Samson, R. A., Larsen, T. O. & Thrane, U. Fumonisin B2 550 production by Aspergillus niger. J. Agric. Food Chem. 55, 9727–32 (2007). 551 19. Frisvad, J. C. et al. Fumonisin and ochratoxin production in industrial Aspergillus niger strains. 552 PLoS One 6, (2011). 553 20. Perrone, G. et al. Biodiversity of Aspergillus species in some important agricultural products. 554 in Studies in Mycology 59, 53–66 (2007). 555 21. Monod, M. et al. Secreted proteases from pathogenic fungi. Int. J. Med. Microbiol. 292, 405– 556 19 (2002). 557 22. Jurjević, Z. et al. Two novel species of Aspergillus section Nigri from indoor air. IMA Fungus 558 3, 159–73 (2012). 559 23. Varga, J. et al. New and revisited species in Aspergillus section Nigri. Stud. Mycol. 69, 1–17 560 (2011). 561 24. Samson, R. A., Houbraken, J. A. M. P., Kuijpers, A. F. A., Frank, J. M. & Frisvad, J. C. New 562 ochratoxin A or sclerotium producing species in Aspergillus section Nigri. Stud. Mycol. 50, 45– 563 61 (2004). 564 25. Samson, R. A. et al. Diagnostic tools to identify black aspergilli. Stud. Mycol. 59, 129–45 565 (2007). 566 26. Samson, R. A. et al. Phylogeny, identification and nomenclature of the genus Aspergillus. 567 Stud. Mycol. 78, 141–73 (2014). 568 27. Visagie, C. M. et al. Aspergillus, Penicillium and Talaromyces isolated from house dust 569 samples collected around the world. Stud. Mycol. 78, 63–139 (2014). 570 28. Rajendran, C. & Muthappa, B. N. Saitoa, a new genus of Plectomycetes. Proc. Plant Sci. 89, 571 185–191 572 29. Frisvad, J. C. et al. Formation of Sclerotia and Production of Indoloterpenes by Aspergillus 573 niger and Other Species in Section Nigri. PLoS One 9, e94857 (2014). 574 30. Nielsen, K. F., Mogensen, J. M., Johansen, M., Larsen, T. O. & Frisvad, J. C. Review of 575 secondary metabolites and mycotoxins from the Aspergillus niger group. Analytical and 576 Bioanalytical Chemistry 395, 1225–1242 (2009). 577 31. Pel, H. J. et al. Genome sequencing and analysis of the versatile cell factory Aspergillus niger 578 {CBS} 513.88. Nat. Biotechnol. 25, (2007). 579 32. Andersen, M. R. et al. Comparative genomics of citric-acid-producing Aspergillus niger ATCC 580 1015 versus enzyme-producing CBS 513.88. Genome Res. 21, 885–897 (2011). 581 33. Yamada, O. et al. Genome sequence of Aspergillus luchuensis NBRC 4314. DNA Res. 582 (2016). doi:10.1093/dnares/dsw032 583 34. de Vries, R. P. et al. Comparative genomics reveals high biological diversity and specific 584 adaptations in the industrially and medically important fungal genus Aspergillus. Genome Biol. 585 18, 28 (2017). 586 35. Kozakiewicz, Z. et al. Proposals for nomina specifica conservanda and rijicienda in Aspergillus 587 and Penicillium (Fungi). Taxon 41, 109–113 (1992). 588 36. Grigoriev, I. V. et al. MycoCosm portal: Gearing up for 1000 fungal genomes. Nucleic Acids 589 Res. 42, 699–704 (2014). 590 37. Grigoriev, I. V., Martinez, D. A. & Salamov, A. A. Fungal genomic annotation. Appl. Mycol. 591 Biotechnol. 6, 123–142 (2006). 592 38. Hunter, S. et al. InterPro: the integrative protein78 signature database. Nucleic Acids Res. 37, 593 (2008).

15

594 39. Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 595 years. Nucleic Acids Res. 43, D213–D221 (2015). 596 40. Sørensen, A., Lübeck, P. S., Lübeck, M., Teller, P. J. & Ahring, B. K. β-glucosidases from a 597 new Aspergillus species can substitute commercial β-glucosidases for saccharification of 598 lignocellulosic biomass. Can. J. Microbiol. 57, 638–650 (2011). 599 41. Sørensen, A. et al. Identifying and characterizing the most significant β-glucosidase of the 600 novel species Aspergillus saccharolyticus. Can. J. Microbiol. 58, 1035–1046 (2012). 601 42. Karaffa, L. & Kubicek, C. P. Aspergillus niger citric acid accumulation: do we understand this 602 well working black box? Appl. Microbiol. Biotechnol. 61, 189–96 (2003). 603 43. Andersen, M. R., Nielsen, M. L. & Nielsen, J. Metabolic model integration of the bibliome, 604 genome, metabolome and reactome of Aspergillus niger. Mol. Syst. Biol. 4, 178 (2008). 605 44. Hossain, A. H. et al. Rewiring a secondary metabolite pathway towards itaconic acid 606 production in Aspergillus niger. Microb. Cell Fact. 15, 130 (2016). 607 45. Benoit, I. et al. Closely related fungi employ diverse enzymatic strategies to degrade plant 608 biomass. Biotechnol. Biofuels 8, 107 (2015). 609 46. Fox, E. M. & Howlett, B. J. Secondary metabolism: regulation and role in fungal biology. Curr. 610 Opin. Microbiol. 11, (2008). 611 47. Nielsen, J. C. et al. Global analysis of biosynthetic gene clusters reveals vast potential of 612 secondary metabolite production in Penicillium species. Nat. Microbiol. 2, 17044 (2017). 613 48. Ali, H. et al. A non-canonical NRPS is involved in the synthesis of fungisporin and related 614 hydrophobic cyclic tetrapeptides in Penicillium chrysogenum. PLoS One 9, (2014). 615 49. Andersen, M. R. et al. Accurate prediction of secondary metabolite gene clusters in 616 filamentous fungi. Proc. Natl. Acad. Sci. U. S. A. 110, E99-107 (2013). 617 50. Frandsen, R. J. N. et al. The biosynthetic pathway for aurofusarin in Fusarium graminearum 618 reveals a close link between the naphthoquinones and naphthopyrones. Mol. Microbiol. 61, 619 (2006). 620 51. Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large 621 phylogenies. Bioinformatics 30, 1312–1313 (2014). 622 52. Katoh, K. & Standley, D. M. MAFFT: multiple sequence alignment software version 7: 623 improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). 624 53. Talavera, G. & Castresana, J. Improvement of phylogenies after removing divergent and 625 ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564–577 626 (2007). 627

79

16 80 Chapter 3

Uncovering bioactive compounds in Aspergillus section Nigri by genetic dereplication using secondary metabolite gene cluster networks

The vast amount of bioactive compounds produced by Aspergilli makes them a great resource of new pharmaceuticals. Secondary metabolite gene clusters (SMGCs) are the genetic basis for these compounds and are also an defining the ecology of fungi. In our effort to describe the SMGCs of the whole As- pergillus genus, we redefine genome mining by exploiting homologous genes through species. We create similarity networks of SMGCs to sort them into non-redundant families. Our analysis reveals the genetic basis for analogous SM pathways throughout Aspergillus section Nigri and determines shared SMGC families between species on isolate, clade, section and genus level, illustrating their dynamics and evolution. Another application is to char- acterize unknown SMGC through guilt-by-association with known examples inside their family — a process we termed genetic dereplication. Genetic dereplication facilitated genome mining and target prioritization in newly sequenced strains. The final extension of our approach elucidates the genes for unlinked SMs with valuable bioactivities. Here, we correlate the pres- ence of SMGC families through species and information on producing species to predict a non-ribosomal peptide synthetase necessary for production of the anticancer compound enhancing malformins (Wang et al., 2015b) in 18 strains. To illustrate the general applicability of our analyses, we decided to establish genetic engineering tools in A. brasiliensis; a species which has not previously been engineered nor used for elucidation of SMGC. We verified the predicted malformin synthetase gene mlfA by creating an A. brasiliensis mlfA deletion strain. Subsequent complementation of the knock-out strain with mlfA restored malformin production, demonstrating the accuracy of our pre- diction, and identifying the genetic basis of this pharmaceutically interesting compound.

81 Uncovering bioactive compounds in Aspergillus section Nigri by genetic dereplication using secondary metabolite gene cluster networks

Sebastian Theobalda, Tammi Vestha, Jakob Kræmmer Rendsviga, Kristian Fog Nielsena, Robert Rileyb, Lucas Magalhães de Abreuc, Asaf Salamovb, Jens Christian Frisvada, Thomas Ostenfeld Larsena, Mikael Rørdam Andersena,1, and Jakob Blæsbjerg Hoofa,1 aDepartment of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark; bDepartment of Energy Joint Genome Institute, Walnut Creek, CA, USA; cDepartment of Plant Pathology, Federal University of Viçosa, Viçosa, Brasil

This manuscript was compiled on February 1, 2018

The increased interest in secondary metabolites (SMs) from fungi for peptide bond formation, and epimerisation domains (E) has driven the number of genome sequencing projects to eluci- to change the chirality of their proximate amino acid. Most date their biosynthetic pathways. As a result, studies revealed that NRPSs investigated show a colinearity rule, meaning they are the number of secondary metabolite gene clusters (SMGCs) greatly assembled as modules in the order ATC. Euascomycete specific outnumbers detected compounds, challenging current methods to groups of NRPSs show substantial gain and loss of domains, dereplicate and categorize this amount of gene clusters on a larger further emphasizing the role of this enzyme class in chemi- scale. Here, we present an automated workflow for the genetic cal evolution of fungi (11). Understanding these dynamics dereplication and analysis of secondary metabolism genes in fungi. and describing the diversity of NRPSs (and other secondary Focusing on the secondary metabolite rich genus Aspergillus, we metabolites) throughout the genus Aspergillus will lead the categorize SMGCs across genomes into SMGC families using net- way for new pharmaceutical drugs. work analysis. Our method elucidates the diversity and dynamics Prediction pipelines such as SMURF (12) and antiSMASH of secondary metabolism in section Nigri, showing that SMGC diver- (13) facilitated the mining of genomic sequences for SMGC. sity within the section has the same magnitude as within the genus. Studies using these algorithms on fungal genomes revealed an Furthermore, we show how these families can be used to mine for amount of SMGCs exceeding expectations based on chemical gene clusters responsible for the production of specific compounds. analysis by far (14) — suggesting the biosynthetic potential Using our genome analysis we were able to predict the gene cluster of fungi is larger than anticipated. To efficiently analyse these of the potential anti-cancer compound malformin in 18 strains. To large datasets over several organisms, genome neighbourhood proof the general validity of our predictions, we developed genetic networks have been used previously in bacteria to predict new engineering tools in Aspergillus brasiliensis and subsequently veri- gene clusters and ease strain prioritization for polyketides of fied the genes for biosynthesis of malformin. interest (15). In this study, we used comparative genomics and network analysis to describe the dynamics and diversity Aspergillus comparative genomics genetic dereplication natural | | | of annotated and non-annotated SMGCs of 26 species of the products malformin | SM-rich Aspergillus section Nigri (6) and five reference species. Identifying homologous gene clusters on the isolate, clade and he genus Aspergillus is one of the best studied fungal section level enabled us to define groups with similar SMGC Tgenera, with important species in the industrial, food and medical sector as well as in basic research. Its diverse reper- toire of bioactive SMs e.g. anti-cancer compound enhancing malformins, cholesterol-lowering statins, and the mycotoxin Significance Statement aflatoxin have been detected in numerous analyticalDRAFT studies (6) Fungi harbor a vast amount of secondary metabolites that are — with many SMs applied primarily in the medical industry of great interest for society. Linking genes to metabolites, how- (7). ever, is a laborious task. Analyzing multiple genome sequences SMs are synthesized by different classes of enzymes. In and comparing them on a genus level enables us to identify fungi, these are polyketide synthases (PKS), non-ribosomal genes responsible for these metabolites, and describe evolu- peptide synthetases (NRPS), terpene cyclases (TC), dimethy- tionary patterns and dynamics. Using a network approach, lallyl transferases (DMATS), enzymes only consisting of a we make the algorithm reliable and scalable to more species smaller subset of modules (PKS-Likes, NRPS-Likes), and fu- and genomes, allowing fast expansion of the analysis. As an sions of PKS and NRPS (PKS-NRPS/ NRPS-PKS hybrids). example, we link metabolic data to evolutionary patterns and These enzymes produce a SM backbone which is further modi- directly identify the synthetase responsible for the anticancer fied by tailoring enzymes. The collective of enzymes necessary agent malformin. for production of a SM is encoded by a gene cluster.

NRPSs constitute a major group of secondary metabolite en- S.T., M.R.A., T.V., J.B.N., T.O.L., designed research; S.T., M.R.A., T.V., J.K.R., J.B.N., K.F.N., R.R., zymes and can utilize L-amino acids, as well as non-proteogenic A.S., L.A. performed research; S.T., T.V., J.B.N., analyzed data; and S.T., M.R.A., J.B.N., J.K.R. amino acids as their substrate (8), creating a diverse port- wrote the paper. folio of compounds. Domains inside NRPSs are adenylation The authors declare no conflict of interest. domains (A) for loading of amino acids, thiolation (T) do- 1M.R.A. (E-mail: [email protected]) is corresponding author on the computational analysis and J.B.H. mains for peptide chain transfer, condensation domains (C) is corresponding author on the experimental work ([email protected]) 82 www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | February 1, 2018 | vol. XXX | no. XX | 1–10 content inside section Nigri. few different SMGCs — although none of them have unique As an extension of our approach, we demonstrate the use SMGCs. The shared SMGC content among species varied of SMGC families together with information on metabolite depending on the clade. Species in the A. niger and A. tubin- profiles to mine for the gene cluster responsible for malformin gensis clade share 60–80% — with A. eucalypticola showing a biosynthesis. Malformins, a major group of compounds abun- distinct SMGC composition inside the clade. Species in the dant in section Nigri (16), show anti-tobacco mosaic virus A. carbonarius clade share 60–70% of SMGC families. This activity (17) and act as potentiator of anti-cancer drugs in similarity dropped to 50–60% inside the A. heteromorphus mouse and human colon carcinoma cells (18). Identifying the clade. Most uniseriates shared at least 60% SMGCs, with SMGC responsible for malformin biosynthesis will allow for the exception of A. saccharolyticus and A. homomorphus only optimization of native as well as heterologous gene expression. sharing as few as 40% of their SMGC families with other mem- Our approach successfully predicts the malformin gene cluster bers of the uniseriates. On a section level, we can show that in multiple Aspergillus species, unveiling the feasibility of per- biseriates and uniseriates (apart from the A. heteromorphus forming large scale dereplication of homologous gene clusters clade) each show a SMGC family inter-clade similarity of at using collections of genome-sequenced strains. least 40%. Comparing the A. tubingensis and A. niger clade to uniseriates the SMGC similarity is 10–30% — the same as Results between the section Nigri and the reference species. Hence, we can determine that the diversity of secondary metabolites Creating families of secondary metabolite gene clusters. In inside section Nigri is similar to the diversity seen across the order to describe the secondary metabolite gene cluster diver- genus as a whole. sity of section Nigri, we analyzed genomes of 32 Aspergilli plus From this, it can be inferred that section Nigri must have the reference species A. oryzae, A. flavus, A. fumigatus, A. undergone a substantial gain and loss of secondary metabolite nidulans and Penicillium chrysogenum. Aspergilli are known genes. In species which show a larger difference in SMGC to produce similar SMs throughout species (6, 19, 20), thus composition to closely related species — as in the case of A. including these species would ensure to relate SMGC content eucalypticola — suggests horizontal gene transfer from outside to phylogeny and potentially reveal homologous gene clusters. section Nigri or retention of SMGC. Surprisingly, a small All 37 genome sequences were analyzed using our previously amount of SMGCs seem to be retained in the whole genus established pipeline (21). The pipeline creates families of since we find at least 10% similarity of SMGC families between homologous gene clusters using local alignment of protein the species included in the analysis. Additionally, a maximum sequences of cluster members. It retains bidirectional hits that of 30% shared SMGC families between distantly related species are above the coverage cutoff and uses the percent identity exceeds the SMGC similarity previously anticipated in the to compute a similarity score for each query cluster to each genus Aspergillus (22). hit cluster. Subsequently, the similarity scores are used to In conclusion, we can confirm that the clustering of SMGCs create a network of all SMGCs and random walk clustering into families reflects the SM distribution of species in analyt- is used to create families of SMGCs. Using the pipeline on ical studies. The SMGC similarity over large phylogenetic the dataset, we detected 2,717 gene clusters and categorized distances suggests analogous pathways in the same family. them into 455 families — groups predicted to produce similar compounds based on homologous gene clusters — of which Coupling MIBiG annotation to SMGC families automates ge- 245 only contain one cluster and are therefore unique (21). netic dereplication of compounds. With the diversity of SMGCs established through our dataset, we were interested Comparative genomics reveals secondary metabolite gene in the presence of gene clusters linked to known compounds cluster diversity on several taxonomic levels in section Nigri. through section Nigri. With an increasing number of available With the establishment of SMGC families in related species, we fungal genome sequences, we see an identification of known were interested whether SMGC content is reflected in species, compounds by genomic methods as crucial, since laboratory clade, section and genus circumscription. Differences in SM conditions might not reveal the full metabolite profile of a content have been shown previously for the groupsDRAFT of uniseri- fungus. Furthermore, it may help to avoid experiments re- ates (species with phialides attached directly to the vesicle) identifying the same gene cluster in multiple species (similar and biseriates (species with metulae between phialides and to the process known as metabolite dereplication in chemical vesicle) inside section Nigri (Fig.1,( 6)). Hence, differences analysis (23)). To achieve this, we used 1461 gene clusters of in SM production should be the result of different SMGCs the Minimum Information about a Biosynthetic Gene cluster present in species. (MIBiG) database (24) to identify known compounds with Comparing all shared SMGCs throughout species in the characterized SMGCs in our SMGC families and determine dataset, we established a dendrogram based on the similarities related compounds. This is of special interest for mycotoxins of SMGCs between fungi (Fig.1), which to a large extent and compounds with medical applications. reflects the phylogenetic tree of Aspergilli (left side of Fig.1 Using protein BLAST (25), we identified 34 best hits found panel A ; (21)) with subtle differences. The heatmap in figure in our SMGC families for compound-linked gene clusters re- 1 provides further insight into SMGC similarity of species. trieved from MIBiG. Since SMGC families represent groups Clustering the organisms by their SMGC families shows a of homologous and related gene clusters, we can identify the clear distinction between biseriate species, uniseriate species SMGC family of the hit as a related gene cluster producing and reference species. Inside these groups, further subgroups a similar compound by using a guilt by association approach. can be identified. Hence, new genomes can be analyzed and added with infor- A. niger isolates shared 80–100% of SMGC families — mation on their secondary metabolite production capabilities. pointing out that isolates of the same species can carry a The associated compounds and presence patterns of gene clus- 83 2 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. ters are shown in Fig.1B. TAN-1612 (a neuropeptide Y antagonist (34)) can be found Of the 34 known gene clusters used to annotate the SMGC in the same SMGC family. The gene cluster in A. nidulans is families in the dataset, two gene clusters linked to the com- producing asperthecin, while the gene clusters predicted in the pounds fungisporin, YWA and one gene cluster linked to the A. niger and A. tubingensis clades are likely producing TAN- siderophore ferrichrome were found in all species of the dataset 1612 since the uniform presence in these two clades suggests (21). This illustrates that we can detect homologous gene clus- the gene cluster to be conserved throughout species (Fig.1B, ters over the genus as similarly shown by the shared gene supplementary file 1). This highlights further that our method cluster overview (Fig.1). 14 gene clusters were only found in can be used to mine for similar compounds in SMGC families. the reference species. A gene cluster predicted in the A. niger clade and A. tubingensis is highly similar to the pseurotin A gene cluster, a SMGCs for different heteroisoextrolites based on 6-MSA are found in chitin synthase inhibitor (35) from A. fumigatus (36), where section Nigri. Aspergili in distinct sections are known to produce it forms a super cluster together with fumagillin (37). functionally similar types of secondary metabolites, also called Uniseriates contain a related gene cluster to A. flavus (39) heteroisoextrolites (20). These heteroisoextrolites are based and A. oryzae (40) responsible for the production of food on analogous biosynthetic pathways which we successfully contaminant and mycotoxin cyclopiazonic acid (CPA). This annotated in gene cluster families. further confirms our findings that a number of SMGCs can Using our automated method, we were able to de- be shared over large phylogenetic distances (Fig.1). It is tect SMGCs for heteroisoextrolites that are based on 6- surprising that the gene cluster is also found in A. saccharolyti- methylsalicylic acid (6-MSA), in particular the antifungal cus and A. heteromorphus since they differ in their SMGC patulin (2, 26) and the antimicrobial yanuthone D (3, 27). As- content from the rather SM homogeneous rest of uniseriates sociation of the patulin gene cluster with its family identifies (Fig.1). Kato et al. ( 41) highlighted that CPA is produced 11 patulin-like gene clusters in uniseriates. Further inspection but converted in A. oryzae, so it remains to be answered if the of the family revealed the cluster responsible for aculinic acid uniseriate species produce CPA or a derivate thereof. A gene in A. aculeatus which is shown to be highly similar in genetic cluster unique to uniseriates is predicted to be responsible content and function to the patulin gene cluster (28). Simi- for aculeacin A production. Aculeacin A is an interesting larly, gene clusters primarily found in the A. niger clade, as lipopeptide inhibiting β-glucan synthesis (42). well as in Penicillium chrysogenum are predicted to produce All in all, our analysis can automatically identify related the antifungal 6-MSA-based compound yanuthone (27, 29). SMGCs over a large set of species. However, our analysis also The network plot in Fig. S1 shows how the related clusters highlights that genetically dereplicated SMGC only consti- were divided into families and highlights how SMGC networks tute a small fraction of the secondary metabolites potentially can be used to classify related SMGCs. produced by Aspergilli.

SMGCs related to the trypacidin gene cluster can be found in unis- Mining for gene clusters in SMGC families reveals candidates eriate species. Another group of heteroisoextrolites present for the malformin gene cluster in 18 strains. To address the throughout Aspergilli are compounds based on emodin, such large interest in discovery of novel biosynthetic gene clusters as geodin and the mycotoxins trypacidin and secalonic acid for specific compounds, we wanted to link SMGC families to (20). Secalonic acid is of special interest due to a wide range compounds of interest. For this, we focused on malformin- of bioactivities (1, 4). We identified a SMGC family contain- producing species: A. niger, A. brasiliensis, and A. tubingensis ing the monodictyphenone/emodin gene cluster (30) from A. (6). Malformin is interesting as it is both a potential com- nidulans, the trypacidin gene cluster from A. fumigatus (31) pound for medical application in cancer treatment (18) and is and several unlinked gene clusters in uniseriates. Uniseriate produced under laboratory conditions. species are known to produce secalonic acid (19, 32), hence The first criterium for a malformin-producing gene cluster we predicted the unlinked gene clusters in this family to be was set to a gene cluster being present in all producing species. responsible for secalonic acid production. The presence in Since the species mentioned above share 50% or more of their A. brasiliensis is surprising, but according to the analysis of DRAFTSMGC families, more levels of filtering are required to narrow shared SMGC (Fig.1), it is likely that it acquired or retained down candidates. The second criterium considered the size SMGCs from other species and therefore stands out of the A. of candidate NRPSs. Malformin is a pentapeptide consisting niger and A. tubingensis clade. of Val, D-Leu, Ile and two D-Cys amino acids. Hence, it Genetic dereplication predicts a gene cluster for azanigerone-like was anticipated the NRPS would consist of five modules with compounds in 17 strains. Previous studies identified the silent a length of approximately 18kb. As we could not find a azanigerone gene cluster by overexpression of cluster genes in hit with these parameters, we moderated our search to four Aspergillus niger ATCC 1015 (33). In our analysis, we can iden- modules under the hypothesis that one of the modules is tify an azanigerone-like producing gene cluster in biseriates, iterative (as seen in other studies (43)), resulting in a size and A. homomorphus — which is uniseriate (supplementary of approximately 12kb. The third criterium considered the file 1). This further highlights our algorithm as an important annotation of tailoring enzymes present within the cluster. addition to chemical analysis, since genetic dereplication is Tailoring enzymes should contain disulphide bond associated able to identify gene clusters over a large set of genomic se- enzymes to be able to create the disulphide bond included in quences, even though they may be silent in the hosts under malformins (Fig.4 D). normal conditions. Using our search criteria, we found a single candidate NRPS gene cluster matching all of our criteria. The SMGC family SMGC families contain communities of related SMGC. Gene clusters shows a high level of synteny and all gene clusters inside the producing the structurally related polyketides asperthecin and family show the same NRPS size and disulphide bond creating 84 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 3 enzymes (Fig.2). Inside the cluster, we found genes coding for production (Fig.4 C). major facility superfamily transporters, methyltransferases and a transcription factor — activities common to SMGCs. The Discussion gene cluster family was curated by removing four gene clusters shown in Fig. S2, which only aligned to the extended part With the first Aspergillus genome sequences available (51, 52), of the predicted cluster and do not contribute to malformin research on natural products and on the evolution of secondary synthesis. metabolism in fungi experienced a paradigm shift (53, 54). Now with whole genus sequencing projects, these fields are Predicting condensation domain function in MlfA. To further experiencing a second shift due to the amount of generated confirm the candidate cluster, we created a maximum likeli- data. This study fills the gap for automated and scalable multi- hood phylogeny of NRPS condensation domains with known species dereplication and classification of secondary metabolite functions in our dataset ( fungisporin/ nidulanin A (5, 44, 45), gene clusters using similarity networks. Specific to our study fumiquinazolines (46), fumitremorgin/ brevianamide (36) and is the mapping of phylogeny on SMGC families which then penicillin (47)), including condensation domains of the pre- enables a relation of phylogeny to SMGC content. Hence, dicted malformin synthetase MlfA to predict their functions we can specifically determine the amount of SMGC shared (Fig.3). between certain species. According to the amino acid composition of malformin, we Using SMGC similarity networks and random walk clus- expected epimerization, epimerizing D-L joining condensation tering we grouped SMGCs across 37 genomes into families domains and a cyclizing condensation domain to be present of gene clusters. Remarkably, we found that species on the in the synthetase. From branches in the phylogeny, we can isolate level can carry a few distinct SMGC as shown by a predict the functions of the five condensation domains in similarity of 80-90% between A. niger isolates. Previous stud- malformin to be DL-joining (epimerizing subtype), LL-joining, ies have shown that the A. niger group is prone to undergo epimerization, DL-joining and cyclizing domain, thus matching extensive transfer and loss of genetic elements, also reflected in the expectations and supporting the identification of the NRPS production of different exometabolites (55). Species of distinct as involved in malformin production. clades inside section Nigri can share 30–80 % depending on the distance of the species, showing a diversity similar to Peni- Genetic and chemical analysis verifies mlfA prediction. To cillia clades as indicated recently (56). Gene cluster families verify the genetic assignment, we first had to develop ge- over different clades are of special interest since they harbour netic engineering tools in A. brasiliensis. We decided to con- similar compounds as heteroisoextrolites (20). Examples in struct a pyrG∆ strain in order to subsequently generate a this study include SMGC based on emodin as e.g. trypacidin, non-homologous end-joining deficient strain — facilitating effi- secalonic acid and SMGC based on 6-MSA that we could link cient gene targeting (48). We employed a clustered regularly to yanuthone-D, aculinic acid and patulin throughout differ- interspaced short palindromic repeats associated endonuclease ent species of the dataset. 6-MSA based heteroisoextrolites 9 (CRISPR-Cas9) system (49) to induce a double-strand break seem to be common in Trichocomaceae since Penicillia contain (DSB) in pyrG resulting in uridine auxotrophy. Subsequent yanuthone producing gene clusters as well (56). The most sequencing of three candidates confirmed that strain 1 had an phylogenetically distinct clades exhibited a surprising amount out-of-frame mutation in the region corresponding to the proto- of diversity in their SMGC content. In comparisons between spacer via a 16-nucleotide deletion (nucleotides number 45-60, uniseriate and biseriate species, we could show a SMGC simi- allele name pyrG1 ) within pyrG. In this strain, we utilized the larity of 10-20%. We gained similar results in comparisons of CRISPR-Cas9 system to induce a DSB at the akuA locus while Nigri species with reference species. Thus, the SMGC diver- supplying a repair template in form of a linear gene-targeting sity within the section has the same magnitude as within the substrate for akuA. A correct homokaryotic transformant was whole genus Aspergillus. This finding is in accordance with verified as an akuA deletant by diagnostic tissue polymerase analytical studies characterizing biseriates and uniseriates as chain reaction (PCR) (Fig. S3). The strain, akuA∆::AFLpyrG, distinct groups based on their metabolite production (6). On was screened on 5-fluoroorotic acid (5-FOA) enrichedDRAFT growth a genus level, the amount of shared SMGC families over large medium for loss of AFLpyrG by the lack of ability to grow phylogenetic distances is higher in our study than estimated without supplementation of uridine in the medium and di- by Lind et al. (22). agnostic PCR. In the resulting pyrG-free strain, akuA∆, we As a result of our analysis, we were able to predict the targeted the NRPS encoded by Aspbr1_34020, which based malformin gene cluster in 18 strains and confirmed it in A. on the predictions above was the best candidate for malformin brasiliensis. Our results are in accordance with reports of pro- production. Six transformants were streak-purified and PCR ducing strains as mentioned by Nielsen et al. and Vesth et al. analyzed, resulting in two homokaryotic deletion mutants of (6, 21). Surprisingly, the predicted NRPS consists of only four mlfA (see Fig. S3). Both strains were subsequently screened, modules which did not match our assumptions of an NRPS alongside akuA∆::AFLpyrG as reference, for their ability to structure for five amino acids. Examples of NRPS following produce malformin A and C after seven days of cultivation different synthesis schemes, however, are increasing. Enniatin, on yeast extract sucrose (YES) solid growth medium. The enterobactin, bacillibactin and gramicidin S are synthesized by deletion of Aspbr1_34020 (mlfA∆) showed a total abolishment iterative NRPSs (9). Identification of tailoring enzymes coding of malformin production (Fig.4 C). for disulphide-bond associated functions and establishment of Moreover, a genetic complementation by a constitutively a condensation domain model for the predicted gene cluster/ expressed mlfA in the mlfA∆ strain re-established malformin synthetase helped us to further sustain our prediction. A phy- production with the same adduct pattern as for the refer- logenetic characterization of condensation domains was chosen ence strain, thus confirming the role of mlfA in malformin over prediction tools since these tools were built on bacterial 85 4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. data (57). Upon deletion of mlfA, malformin production is more members than available organisms the clustering was repeated abolished. Furthermore, we were able to show that comple- with the family to find further subcommunities. mentation of mlfA∆ strains with mlfA can revive production of Visualization of shared SMGC families. A heatmap containing hirar- malformins. In combination, this makes us confident that mlfA chically clustered column dendrograms was created using the gplots is coding for a NRPS responsible for malformin production. package (64) and a phylogenetic tree imposed on rows (65). We hypothesize the NRPS to act iteratively on one amino acid and possesses multiple amino acid specificities since multiple Mining for malformin producing NRPS. Created SMGC families were malformins disappear after deletion of the NRPS (Fig.4). classified as potential producers of malformin according to three criteria. Strains which are known to produce malformin (A. niger Our study shows that SMGC similarity networks and fami- CBS 513.88, A. niger NRRL3, A. niger ATCC 1015, A. brasiliensis, lies are ideal constructs for guilt by association based derepli- and A. tubingensis) should be included in the family; the clusters cation and genome mining for SMs of interest. This also shows should include an NRPS of at least 14400 nucleotides and tailor- the advantage of using an aggregated similarity score per gene ing enzymes should include the terms ’glutathione’ or ’disulphide’. From the two resulting families, the best hit was used for further cluster as opposed to one comparison per domain, since only investigation. Gene clusters were visualized using Gviz (66). one SMGC will be associated with a compound (56). An- other advantage of our algorithm is to genetically dereplicate Whole genome phylogeny. A whole genome phylogenetic tree was SMGCs which would go unnoticed with existing methods. We generated to compare phylogeny to hirarchical clustering based were able to identify homologs of a gene cluster in 17 strains, on secondary metabolite family content. The phylogeny was con- structed using 200 bidirectional best hits between species. These which is silent in the original host under laboratory condi- best hits were concatenated and aligned using MAFFT (67) and tions (33). Hence, genetic dereplication uncovers new sets of conserved blocks extracted using Gblocks (68). A maximum like- SMGCs as targets for heterologous expression that might not lihood phylogeny was created using the trimmed alignments for be discovered by, e.g. OSMAC (58). Our study also shows multithreaded RAxML with PROTGAMMAWAG model and 100 bootstraps (69). that the amount of dereplicated SMGCs in section Nigri is far lower than the total amount of SMGCs (21). Hence this Prediction of condensation domain types. Condensation domains study will facilitate further efforts to investigate the SMGCs were extracted from protein sequences using annotations from Inter- of section Nigri. proScan5 (70). Separated domains, i.e. domains smaller than 350 Genome sequencing projects are extending in their amount amino acids and less than 100 amino acids apart from a domain of the same type, were merged before proceeding. Resulting domain of data and are in need of automated methods for derepli- sequences were aligned using Clustal Omega (71) and trimmed using cation and characterization (14, 21, 56, 59, 60). This study trimal (72) retaining sequences with over 65 % residue coverage presents a pipeline that will cope with the challenges of in- in over 80 % of sequences and removing all columns with gaps in creasing genomic datasets and pinpoint towards related SMGC more than 20 % of sequences with similarity lower than 0.001 but for synthetic biology approaches. With the relational char- preserving at least 60 % of columns. IQ-tree (73) was used on aligned sequences using a LG+F+I+G4 substitution model (74) acterization of a whole section of species at once, our study and 1000 times bootstrap (75). Functions of condensation domains will serve as a resource for future analyses of SMGCs and of the predicted malformin NRPS could be assigned by coclustering SMs in Aspergilli. Especially similarity networks and guilt by with known examples. association dereplication of genome data will facilitate genome Annotating SMGC families using MIBiG. Gene cluster annotations mining efforts. were downloaded from the MIBiG database (24) and 1461 sequences Finally, similarity networks of SMGCs prove to serve for the of backbone proteins extracted using biopython (38). Protein se- genetic dereplication of SMGCs in several species and establish quences were then blasted against our dataset. Hits reaching a their phylogenetic distribution. Assessing and categorizing the percent identity, query coverage and hit coverage of over 95% were metabolic potential of species in this automated manner will retained to find best hits in our dataset. Corresponding SMGC families were annotated as related cluster of the hit. greatly facilitate the discovery of new relevant SMs. Construction of mutant strains. The wild type culture (WT) A. Materials and Methods brasiliensis (CBS 101740/IBT 21946) (76) was used, to generate a uridine requiring pyrG- strain (pyrG1, BRA6), and from BRA6, a DRAFTknockout strain of the Ku70 homolog akuA was created to enable efficient gene targeting (48), see Table S1 for strains. Genomic DNA Data collection. A customized version of SMURF (12) was used (gDNA) from WT A. brasiliensis was isolated via FastDNA SPIN to annotate secondary metabolite gene clusters throughout draft Kit for Soil DNA extraction kit (MP Biomedicals, USA). Aspergillus genomes. Protein sequences, smurf annotations, interpro All A. brasiliensis strains were cultivated at 30°C on minimal annotations and gff files were obtained from jgi (https://genome.jgi. medium (MM), supplemented with 10 mM uridine if required for doe.gov/). growth. The MM, transformation media (TM) and media for Draft genomes of 32 Aspergilli of section Nigri and five other pyrG counter-selection (MM+5-FOA) were prepared as described Aspergillus genomes were analyzed using our previously established in (49). All transformations employing CRISPR/Cas9 vectors used pipeline to study their secondary metabolite gene cluster diversity hygromycin B (100 µg/ml, Invivogen) for selection. Yeast extract (21). sucrose (YES, (77)) growth media was used for chemical analysis. Chemical competent Escherichia coli DH5α were applied for vector Creation of SMGC families. Families of gene clusters were created assembly and plasmid propagation at 37°C, and E. coli cultivations with the pipeline established in (21). BLAST+ (25) was used to find were carried out in Lumia Broth (LB) media (1% Bacto tryptone, bidirectional hits between all secondary metabolite proteins using 0.5% Bacto yeast extract, 1% NaCl, pH 7.0) supplemented with an E-value of 1e-10, at least 50% identity and a sum of coverage of 0.1% ampicillin. All solid media were supplied with 2% agar. 130% as cutoffs. Subsequently, the identity values were aggregated from query to hit clusters to create a cluster similarity score (21). Secondary metabolite extraction and analysis. Extraction of sec- The established connections were then used to create a network of ondary metabolites from solid media (CYA and YES) 6 plugs secondary metabolite gene cluster proteins (61) and random walk (6 mm) were based on samples across the radius of the fungal clustering (62) in R (63) with 1 step was used to find families of colony, transferred to a microcentrifuge tube and covered in ethyl homologous/ related gene cluster proteins. In case a family carried acetate/2-propanol 3:1(v/v) with 1% (v/v) formic acid for 60 min 86 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 5 ultrasonication. The extraction solvent was transferred to a clean peptide produced by an endophytic fungus Aspergillus tubingensis FJBJ11. International vial, solvents evaporated using N2 flow, and the residues on the Journal of Molecular Sciences 16(3):5750–5761. tube walls were re-dissolved in methanol for 30 min by ultrasonica- 18. Wang J, et al. (2015) Study of malformin C, a fungal source cyclic pentapeptide, as an anti- tion. The samples were centrifuged at 15000 g and the supernatant cancer drug. PLoS ONE 10(11):1–19. transferred to a HPLC auto sampler vial. UHPLC-DAD-QTOFMS 19. Samson RA, et al. (2007) Diagnostic tools to identify black aspergilli. Studies in mycology 59:129–45. was performed on an Agilent Infinity 1290 UHPLC system equipped 20. Frisvad JC, Larsen TO (2015) Chemodiversity in the genus Aspergillus. Applied Microbiology with a diode array detector. Separation was done on a 250 × 2.1 and Biotechnology 99(19):7859–7877. mm i.d., 2.7 µm, Poroshell 120 Phenyl Hexyl column (Agilent Tech- 21. Vesth TC, et al. (2018) The genomes of Aspergillus section Nigri reveal drivers in fungal nologies, Santa Clara, CA) held at 60°C. Subsamples of 1 µL, were speciation (submitted). Submitted to Nature. eluted with a flow rate of 0.35 mL/min using A: water with 20 mM 22. Lind AL, et al. (2015) Examining the Evolution of the Regulatory Circuit Controlling Sec- formic acid and B: acetonitrile with 20 mM formic acid as a gradient ondary Metabolism and Development in the Fungal Genus Aspergillus. PLOS Genetics system starting at 90% A, which linearly dropped to 10% in 15 min, 11(3):e1005096. and held for 2 min before returning to 90% for 2 min. Acetonitrile, 23. Klitgaard A, et al. (2014) Aggressive dereplication using UHPLC-DAD-QTOF: screening ex- tracts for up to 3000 fungal secondary metabolites. Analytical and bioanalytical chemistry methanol, ethyl acetate, 2-propanol and formic acid were analytical 406(7):1933–43. grade (Sigma-Aldrich, St. Louis, MO, USA). Water, acetonitrile and 24. Medema MH, et al. (2015) Minimum Information about a Biosynthetic Gene cluster. Nature formic acid for MS solvents were all LC-MS grade (Sigma-Aldrich). Chemical Biology 11(9):625–631. Mass spectrometry (MS) detection was performed on an Agilent 25. Camacho C, et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics 6545 QTOF MS equipped with an Agilent dual jet stream ESI 10(1):421. operated in ESI+ mode, with MS spectra recorded as centroid data, 26. Tannous J, et al. (2014) Sequencing , physical organization and kinetic expression of the at an m/z of 100 to 1,700, and auto MS/HRMS fragmentation was patulin biosynthetic gene cluster from Penicillium expansum. International Journal of Food performed at three collision energies (10, 20, 40 eV), on the three Microbiology 189:51–60. 27. Petersen LM, et al. (2015) Characterization of four new antifungal yanuthones from As- most intense precursor peaks per cycle. The acquisition was 10 pergillus niger. The Journal of antibiotics 68(3):201–5. spectra/s. Data were treated in Agilent MassHunter Qualitative 28. Petersen LM, et al. (2015) Investigation of a 6-MSA Synthase Gene Cluster in Aspergillus ac- Analysis, and compounds were detected using extracted ion chro- uleatus Reveals 6-MSA-derived Aculinic Acid, Aculins A-B and Epi-Aculin A. ChemBioChem matograms (EICs) ± m/z 0.005 Da of the theoretical masses (78). 16(15):2200–2204. MSHRMS were evaluated against a database of 1500 compounds, 29. Holm DK, et al. (2014) Molecular and chemical characterization of the biosynthesis of the while HRMS and MS/HRMS peaks were matched against around 6-MSA-derived meroterpenoid yanuthone D in Aspergillus niger. Chemistry and Biology 3000 known and suspected Aspergillus compounds. Reference stan- 21(4):519–529. dards of malformins C and A were co-analysed in the sequence. 30. Chiang YM, et al. (2010) Characterization of the aspergillus nidulans monodictyphenone gene cluster. Applied and Environmental Microbiology 76(7):2067–2074. Malformin A2 (C22H37O5N5S2) and C (C23H39O5N5S2) were de- 31. Mattern DJ, et al. (2015) Identification of the antiphagocytic trypacidin gene cluster in the + + + + tected at using EICs of expected adducts ([M H] , [M NH4] , human-pathogenic fungus Aspergillus fumigatus. Applied microbiology and biotechnology + + [M Na] ) based on the calculated monoisotopic mass [M], 515.2236 pp. 10151–10161. Da and 529.2393 Da. 32. Fungaro MHP, et al. (2017) Aspergillus labruscus sp. nov., a new species of Aspergillus sec- tion Nigri discovered in Brazil. Scientific Reports 7(1):1–9. 33. Zabala AO, Xu W, Chooi YH, Tang Y (2012) Discovery and Characterization of a Silent Gene Cluster that Produces Azaphilones from Aspergillus niger ATCC 1015 Reveal a Hydroxylation- ACKNOWLEDGMENTS. We thank Martin Engelhard Kogle and Mediated Pyran-Ring Formation. Chemistry & biology 19(8):1049–59. Ellen Kirstine Lyhne for gDNA preparation of Aspergillus strains. 34. Kodukula K, et al. (1995) BMS-192548, a tetracyclic binding inhibitor of neuropeptide Y re- ST, TV, and MRA gratefully acknowledge support from the Villum ceptors, from Aspergillus niger WB2346. I. Taxonomy, fermentation, isolation and biological Foundation, grant VKR023437. We thank Prof Adrian Tsang for activity. The Journal of antibiotics 48(10):1055–9. making the A. niger NRRL3 genome available. 35. Wenke J, Anke H, Sterner (1993) Pseurotin A and 8-O-Demethylpseurotin A from Aspergillus fimugatus and theirs inhibitory activities osn chitin synthase. Bioscience, Biotechnology, and Biochemistry 57(6):961–964. 1. Hu et al.(2012Secalonic acid D reduced the percentage of side populations by down- 36. Maiya S, Grundmann A, Li SM, Turner G (2006) The fumitremorgin gene cluster of Aspergillus regulating the expression of ABCG2 Biochemical Pharmacology 85(11):1619–1625. fumigatus: Identification of a gene encoding brevianamide F synthetase. ChemBioChem 2. Iwahashi et al.(2006)Mechanisms of patulin toxicity under conditions that inhibit yeast growth 7(7):1062–1069. Journal of Agricultural and Food Chemistry 54(5):1936–1942. 37. Wiemann P, et al. (2013) Prototype of an intertwined secondary-metabolite supercluster. Pro- 3. Bugni et al. (2000)Yanuthones: Novel metabolites from a marine isolate of Aspergillus niger ceedings of the National Academy of Sciences 110(42):17065–17070. Journal of Organic Chemistry 65(21):7195–7200. 38. Cock, P, et al. (2009)Biopython: Freely available Python tools for computational molecular 4. Zhai et al.(2013)Secalonic acid A protects dopaminergic neurons from 1-methyl-4- biology and bioinformatics. 11(25):1422–1423 phenylpyridinium (MPP+)-induced cell death via the mitochondrial apoptotic pathway Eu- 39. Varga J, Baranyi N, Chandrasekaran M, Vágvölgyi C, Kocsubé S (2015) Mycotoxin producers ropean Journal of Pharmacology 713(1–3):58–67. in the Aspergillus genus: An update. Acta Biologica Szegediensis 59(2):151–167. 5. Andersen et al.(2013)Accurate prediction of secondary metabolite gene clusters in filamen- 40. Tokuoka M, et al. (2008) Identification of a novel polyketide synthase-nonribosomal pep- tous fungi. Proceedings of the National Academy of Sciences of the United States of America tide synthetase (PKS-NRPS) gene required for the biosynthesis of cyclopiazonic acid in As- 110(9):E99–E107. pergillus oryzae. Fungal Genetics and Biology 45(12):1608–1615. 6. Nielsen KF, Mogensen JM, Johansen M, Larsen TO, Frisvad JC (2009) Review of secondary 41. Kato N, et al. (2011) Genetic Safeguard against Mycotoxin Cyclopiazonic Acid Production in metabolites and mycotoxins from the Aspergillus niger group. Analytical and Bioanalytical Aspergillus oryzae. ChemBioChem 12(9):1376–1382. Chemistry 395(5):1225–1242. DRAFT 42. Hector RF (1993) Compounds active against cell walls of medically important fungi. Clinical 7. Martínez-Núñez MA, et al. (2016) Nonribosomal peptides synthetases and their applications microbiology reviews 6(1):1–21. in industry. Sustainable Chemical Processes 4(1):13. 43. Juguet M, et al. (2009) An Iterative Nonribosomal Peptide Synthetase Assembles the 8. Finking R, Marahiel MA (2004) Biosynthesis of Nonribosomal Peptides. Annual Review of Pyrrole-Amide Antibiotic Congocidine in Streptomyces ambofaciens. Chemistry and Biology Microbiology 58(1):453–488. 16(4):421–431. 9. Mootz HD, Schwarzer D, Marahiel MA (2002) Ways of assembling complex natural products 44. Klitgaard A, Nielsen JB, Frandsen RJN, Andersen MR, Nielsen KF (2015) Combining Sta- on modular nonribosomal peptide synthetases. ChemBioChem 3(6):490–504. ble Isotope Labeling and Molecular Networking for Biosynthetic Pathway Characterization. 10. Fischbach MA, Walsh CT, Clardy J (2008) The evolution of gene collectives: How natural Analytical Chemistry 87(13):6520–6526. selection drives chemical innovation. Pnas 105(12):4601–4608. 45. Ali H, et al. (2014) A non-canonical NRPS is involved in the synthesis of fungisporin and 11. Bushley KE, Turgeon BG (2010) Phylogenomics reveals subfamilies of fungal nonribosomal related hydrophobic cyclic tetrapeptides in Penicillium chrysogenum. PLoS ONE 9(6). peptide synthetases and their evolutionary relationships. BMC evolutionary biology 10(1):26. 46. Gao X, et al. (2012) Cyclization of fungal nonribosomal peptides by a terminal condensation- 12. Khaldi N, et al. (2010) SMURF: Genomic mapping of fungal secondary metabolite clusters. like domain. Nature chemical biology 8(10):823–30. Fungal genetics and biology : FG & B 47(9):736–41. 47. Diez B, Ii V, Martin JF, Barredosll JL (1990) The Cluster of Penicillin Biosynthetic Genes. 13. Medema MH, et al. (2011) antiSMASH: rapid identification, annotation and analysis of sec- Biochemistry 265(27):16358–16365. ondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. 48. Nielsen JB, Nielsen ML, Mortensen UH (2008) Transient disruption of non-homologous end- Nucleic acids research 39(Web Server issue):339–46. joining facilitates targeted genome manipulations in the filamentous fungus Aspergillus nidu- 14. de Vries RP, et al. (2016) Comparative genomics reveals high biological diversity and specific lans. Fungal genetics and biology : FG & B 45(3):165–70. adaptations in the industrially and medically important fungal genus Aspergillus. pp. 1–45. 49. Nødvig CS, Nielsen JB, Kogle ME, Mortensen UH (2015) A CRISPR-Cas9 system for genetic 15. Rudolf JD, Yan X, Shen B (2016) Genome neighborhood network reveals insights into engineering of filamentous fungi. PLoS ONE 10(7):1–18. enediyne biosynthesis and facilitates prediction and prioritization for discovery. Journal of 50. Chung BKW, Yudin AK (2015) Disulfide-bridged peptide macrobicycles from nature. Organic Industrial Microbiology and Biotechnology 43(2-3):261–276. & Biomolecular Chemistry 13(33):8768–8779. 16. Yukioka M, Winnick T (1966) Synthesis of malformin by an enzyme preparation from As- 51. Galagan JE, et al. (2005) Sequencing of Aspergillus nidulans and comparative analysis with pergillus niger. Journal of Bacteriology 91(6):2237–2244. A. fumigatus and A. oryzae. Nature 438(7071):1105–1115. 17. Tan QW, Gao FL, Wang FR, Chen QJ (2015) Anti-TMV activity of malformin A1, a cyclic penta- 87 6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. A B

Tubingensis clade biseriates Niger clade Carbonarius clade Heteromorphus clade Uniseriates Other uniseriates other Emodin, Aculinic acid, Aculinic

DRAFTrelated compounds Fig. 1. Heatmap of shared SMGC families and gene clusters linked to compounds. This heatmap contains information on phylogeny of used strains, shared SMGC families and metabolite-linked gene clusters based on MIBiG entries. The row dendrogram represents a whole genome phylogeny of strains in the dataset. (A) Relative amounts of shared SMGC families between species in percent. Here, the presence of SMGC families resulting from our pipeline was compared through all species in the dataset. Percentage is indicated as color gradient in bins of 10% from grey cells (0-10%, not present in dataset) to red cells (90-100%) as shown by the color key. Additionally, a histogram indicates the abundance of different amounts of shared SMGC families, hence, how many comparisons result in low or high similarity respectively. Species self-comparison always results in values of 100%. The key shows that the majority of species to species comparisons indicates a 20-30 % similarity of SMGC families. Within clades, the values are typically above 50%. 80-100% is shared among isolates of the same species and closely related species. The column dendrogram represents a hierarchical clustering of organisms by shared SMGC percent, hence strains clustering together will share a high amount of SMGC. (B) Identification of compound-linked gene clusters based on MIBiG entries. Best hits for MIBiG entries,that suffice the cutoffs, were identified inside families using protein BLAST (red dot). Emodin and aculinic acid were identified manually by sequence identifier (blue dot). Using a guilt by association approach, the whole family of gene clusters is considered to be responsible for the production of a similar metabolite. The heatmap column dendrogram is clustered hierarchically based on presence of compound-linked gene clusters, hence, organisms with shared known SMGC will cluster together.

88 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 7 A. luchuensis CBS 106.47

A. luchuensis IFO 4308 A. piperis

A. eucalypticola

A. costaricaensis

A. neoniger

A. tubingensis

A. vadensis

A. niger CBS 513.88 A. lacticoffeatus

A. niger ATCC 13496 A. niger (A phoenicis ( A. niger ATCC 1015

A. niger NRRL3

A. welwitschiae

A. brasiliensis

A. sclerotiicarbonarius

A. homomorphus ● Major facilitator superfamily ● AMP−dependent synthetase/ligase−like 0 kb 20 kb ● Methyltransferase ● Aminotransferase ● Transcription factor ● 10 kb ● NRPS ● pyridine nucleotide−disulphide oxidoreductase ● NRPS−Like ● PyridineDRAFT nucleotide−disulphide oxidoreductase, ● Glutathione S−transferase FAD/NAD(P)−binding domain

Fig. 2. Predicted SMGC family for malformin producing gene clusters. Intepro annotations are indicated by color. The shown family contains NRPS encoding genes and tailoring enzymes which fulfilled our criteria. We hypothesize that genes coding for glutathione-S-transferases or pyridine-disulphide oxidoreductases are responsible for the disulphide bond in malformin (structure in Fig.4)

89 8 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. A 93 ●FQ 95 ●FR 97 ●MA (2) 71 ●FG A 538.2131 B 552.2298 ●PE ●FR 530.2465 ●MA(5) 516.2310 99 ●FG 533.2582 547.2734 83 ●FQ ●FQ 515 520 525 530 535540 530535540545550 555 ●MA (4) m/z m/z ●FG ●FG C ●FG Reference 88 ●MA (1) ●PE mlfAΔ, mlfA-Oex ●PE 88 ●FQ mlfAΔ, mlfA-Oex ●MA (3) ●FG mlfAΔ ●FG 8 9 10 11 Time (min) D Condensation domain type ● Cyclization ● Epimerization ● LL−joining

● DL−joining ● Inactive ● NA B Compound NRPS domains Fungisporin (FG) Aspergillus niger Fumiquinazoline (FQ) Aspergillus fumigatus Fumitremorgin (FR) Malformin A2 Malformin C Aspergillus fumigatus Fig. 4. Extracted Ion Chromatograms (EIC) for malformin overexpressing (mlfA∆, Penicillin (PE) mlfA-Oex) and malformin knock-out (mlfA∆) strains. A and B show MS spectra + + + + + + Penicillium chrysogenum of detected adducts [M H] , [M NH4] and [M Na] for the peaks displayed in C Malformin* showing merged EICs of the six adducts ( 0.005 Da) in the reference strain (akuA ± Aspergillus brasiliensis ∆ ::AFLpyrG), mlfA∆- mlfa-Oex (mlfA∆ IS1::PgdpA-mlfA) and mlfA deletion strain (mlfA∆). A reveals the peak at RT 8.9 min contains calc. m/z 516.2310, 533.2582, Fig. 3. Classification of condensation domains inside the predicted NRPS responsible 538.2131, corresponding to adducts of low-mass malformins, e.g. A2 and C, re- for Malformin synthesis. A: Approximate Maximum likelihood phylogeny of conden- spectively (D). The two peaks at RT 9.4-9.7 min contain the adducts of high-mass sation domain amino acid sequences. Sequences of condensationDRAFT domains with malformins, calc. m/z 530.2465, 547.2734, 552.2298 (B), where the largest peak known activities from fungisporin (FG), fumiquinazolines (FQ), fumitremorgin (FR) and at RT 9.7 min represents malformin C as determined by comparison to a reference penicillin (PE) were used to infer activities of condensation domains in the predicted standard of malformin C (D). The small peak at RT 9.4 min denotes another of malformin (MA) producing NRPS. The tree was generated from 60% of conserved the high-mass malformin (e.g. malformin A1, B1, B3, B4) (50). In C, the vertical aligned columns and bootstrapped 1000 times. Bootstrap values over 70 are shown axis displaying MS counts is not shown, however the intensity of the tallest peak is approximately 2 106. next to their node. The analysis shows distinct clusters corresponding to functions of × condensation domains supported by high bootstrap support. B: Schematic for used NRPS proteins. Condensation domains are highlighted according to their function as depicted in the legend (NA: not available). Adenylation and pcp domains are represented by white cells.

90 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 9 52. Pel HJ, et al. (2007) Genome sequencing and analysis of the versatile cell factory Aspergillus tical Genomics: Methods and Protocols, eds. Mathé E, Davis S. (Springer New York, New niger CBS 513.88. Nature biotechnology 25(2):221–31. York, NY), pp. 335–351. 53. Khaldi N, Collemare J, Lebrun MH, Wolfe KH (2008) Evidence for horizontal transfer of a 67. Katoh K, Misawa K, Kuma Ki, Miyata T (2002) MAFFT: a novel method for rapid multiple secondary metabolite gene cluster between fungi. Genome biology 9(1):R18. sequence alignment based on fast Fourier transform. Nucleic acids research 30(14):3059– 54. Nielsen ML, et al. (2011) A genome-wide polyketide synthase deletion library uncovers novel 3066. genetic links to polyketides and meroterpenoids in Aspergillus nidulans. FEMS Microbiol Lett 68. Castresana J (2000) Selection of Conserved Blocks from Multiple Alignments for Their Use 321(2):157–166. in Phylogenetic Analysis. Molecular Biology and Evolution 17(4):540–552. 55. Andersen MR, et al. (2011) Comparative genomics of citric-acid-producing Aspergillus niger 69. Stamatakis A (2014) RAxML version 8: A tool for phylogenetic analysis and post-analysis of ATCC 1015 versus enzyme-producing CBS 513.88. Genome Research 21(6):885–897. large phylogenies. Bioinformatics 30(9):1312–1313. 56. Nielsen JC, et al. (2017) Global analysis of biosynthetic gene clusters reveals vast potential 70. Jones P, et al. (2014) InterProScan 5: Genome-scale protein function classification. Bioinfor- of secondary metabolite production in Penicillium species. 17044(April). matics 30(9):1236–1240. 57. Rausch C, Hoof I, Weber T, Wohlleben W, Huson DH (2007) Phylogenetic analysis of conden- 71. Sievers F, et al. (2011) Fast, scalable generation of high-quality protein multiple sequence sation domains in NRPS sheds light on their functional evolution. BMC evolutionary biology alignments using Clustal Omega. Molecular systems biology 7(1):539. 7:78. 72. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: A tool for automated align- 58. Bode HB, Bethe B, Höfs R, Zeeck A (2002) Big effects from small changes: possible ways to ment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973. explore nature's chemical diversity. Chembiochem 3(7):619–627. 73. Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ (2015) IQ-TREE: A fast and effective 59. Charlop-Powers Z, et al. (2016) Urban park soil microbiomes are a rich reservoir of nat- stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and ural product biosynthetic diversity. Proceedings of the National Academy of Sciences Evolution 32(1):268–274. 113(51):201615581. 74. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS (2017) ModelFinder: 60. Ziemert N, et al. (2014) Diversity and evolution of secondary metabolism in the marine actino- Fast Model Selection for Accurate Phylogenetic Estimates. Nature Methods 14(6):587–591. mycete genus Salinispora. Proceedings of the National Academy of Sciences of the United 75. Minh BQ, Nguyen MAT, Von Haeseler A (2013) Ultrafast approximation for phylogenetic boot- States of America 111(12):1130–9. strap. Molecular Biology and Evolution 30(5):1188–1195. 61. Csardi G, Nepusz T (2006) The igraph software package for complex network research. In- 76. Varga J, et al. (2007) Aspergillus brasiliensis sp. nov., a biseriate black Aspergillus species terJournal Complex Sy:1695. with world-wide distribution. International Journal of Systematic and Evolutionary Microbiol- 62. Pons P, Latapy M (2005) Computing communities in large networks using random walks. ogy 57:1925–1932. Physics and Society p. arXiv:physics/0512106. 77. Frisvad JC, Samson RA (2004) Polyphasic taxonomy of Penicillium subgenus Penicillium: 63. R Core Team (2017) R: A Language and Environment for Statistical Computing. A guide to identification of food and air-borne terverticillate Penicillia and their mycotoxins. 64. Warnes GR, et al. (2016) gplots: Various R Programming Tools for Plotting Data. Studies in Mycology 2004(49):1–173. 65. Yu G, Smith D, Zhu H, Guan Y, Lam TTY (2017) ggtree: an R package for visualization and 78. Kildgaard S, et al. (2014) Accurate dereplication of bioactive secondary metabolites from annotation of phylogenetic trees with their covariates and other associated data. Methods in marine-derived fungi by UHPLC-DAD-QTOFMS and a MS/HRMS library. Marine Drugs Ecology and Evolution. 12(6):3681–3705. 66. Hahne F, Ivanek R (2016) Visualizing Genomic Data Using Gviz and Bioconductor in Statis-

DRAFT

91 10 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 92 Chapter 4

Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli

The previous chapters were dealing with secondary metabolite gene clusters as a whole and focusing on creating families that produce similar compounds. In this chapter one enzyme class is being investigated more closely to find dynamics on the domain level between SM genes. Phylogenetic studies of secondary metabolite enzymes like polyketide synthases (PKS) and non- ribosomal peptide synthetases (NRPS) have elucidated their evolutionary dynamics and origins. Hybrids of these enzymes, PKS-NRPS and NRPS- PKS enzymes, however, have not been the focus of these studies — neglect- ing their evolution. In this study, we generated a phylogeny focusing on hybrids to investigate their diversity and dynamics in the genus Aspergillus. Furthermore, we relate hybrids to PKSs and NRPSs using their adenylation and ketoacylsynthase domains, since we hypothesize that hybrids can, as opposed to previous studies Gallo et al. (2013), be polyphyletic. Based on evidence that fungi have acquired hybrids from bacteria and fungi by lat- eral transfer(Lawrence et al., 2011; Khaldi et al., 2008), we blasted fungal hybrids against the NCBI non-redundant protein database to find further evidence for horizontal gene transfer. Our analysis indicates that hybrids are frequently transferred between fungi and one class of hybrids originated in bacteria. Our results highlight the dynamics of this enzyme class and will serve as guideline for recombination experiments to form new chimeras.

93 Theobald et al.

RESEARCH Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli

Sebastian Theobald*, Tammi C. Vesth and Mikael R. Andersen

Abstract Background: Filamentous fungi produce a vast amount of bioactive secondary metabolites (SMs) synthesized by e.g. hybrid polyketide synthase-nonribosomal peptide synthetase enzymes (PKS-NRPS; NRPS-PKS). While their domain structure suggests a common ancestor with other SM proteins, their evolutionary origin and dynamics in fungi are still unclear. Recent rational engineering approaches highlighted the possibility to reassemble hybrids into chimeras — suggesting molecular recombination as diversifying mechanism. Results: Phylogenetic analysis of hybrids in 39 strains lets us describe their dynamics throughout the genus Aspergillus. The tree topology indicates that three groups of PKS-NRPS hybrids as well as one group of NRPS-PKS developed independently from each other in Aspergilli. Comparison to other SM genes lead to the conclusion that hybrids in Aspergilli have several PKS ancestors; in contrast, hybrids are monophyletic when compared with NRPSs — with the exception of a small group of NRPSs. Our analysis also revealed that NRPS-likes are derived from NRPSs at many different events. An extended phylogenetic analysis including multiple bacterial and fungal taxa revealed multiple ancestors of hybrids. Homologous hybrids are present in all sections which suggests frequent horizontal gene transfer between genera and a finite number of hybrids in fungi. Conclusion: Phylogenetic distances between hybrids provide us with evidence for their evolution: Large inter-group distances indicate multiple independent events leading to hybrid development, while short intra-group distances of hybrids from different taxonomic sections indicate frequent horizontal gene transfer. Our results are further supported by adding bacterial and fungal genera. Presence of related hybrid genes in all Ascomycetes suggests a frequent horizontal gene transfer between genera and a finite diversity of hybrids — also explaining their scarcity. The provided insights into relations of hybrids and other SM genes will serve in rational design of new hybrid enzymes. Keywords: Aspergillus; PKS-NRPS hybrids; Secondary metabolites; Gene clusters

Background as e.g. the non-ribosomal peptides malformins [1]. The Secondary metabolites (SMs), non-growth associated enzymes classes producing these distinct compounds compounds, have been subject to research efforts due — polyketide synthases (PKSs) and non-ribosomal to their wide range of bioactivities. Polyketides like peptide synthetases (NRPSs) — can also be fused sterigmatocystin and aflatoxin, two potent mycotox- into a PKS-NRPS or NRPS-PKS hybrid, creating a ins, cause food spoilage; while others like the choles- chimeric compound. The products of hybrids, e.g. cy- terol lowering lovastatins can be used as medical drugs. clopiazonic acid, pyranonigrin and cytochalasin show Many SMs are promising leads for anti-cancer drugs a wide range of bioactivities [2–4]. *Correspondence: [email protected] Evolutionary events leading to new enzymes and Department of Biotechnology and Biomedicine, Technical University of hence compounds have been described in detail for Denmark, Anker Engelunds vej 1, Kgs. Lyngby, DK Full list of author information is available at the end of the article PKSs and NRPSs. Exchange of initiation modules for 94 Theobald et al. Page 2 of 14

modification of primer units, module duplication, hor- Terrei and P. chrysogenum to cover a wide range of izontal gene transfer and acquisition of tailoring en- Eurotiomycetes (Fig.1. In the phylogeny, NRPS-PKS zymes diversified polyketides [5–7]. Other studies sug- and PKS-NRPS hybrids form several distinct groups gest a burst of PKS duplications in the early Pezizomy- in the phylogeny, with the PKS-NRPS orientation be- cotina, a predeccor of mainly Ascomycota,[8] as ma- ing more abundant in Aspergilli (Fig.1). One group jor driver for PKS diversity. Similarly, phylogenomic including the cytochalasin-associated hybrid gene ccsA studies of NRPSs suggest a horizontal gene transfer only include few species and show large phylogenetic from bacteria to fungi [9]. suggest mono and bimod- distance to other hybrids, indicating that they are ular NRPSs to be most abundant in bacteria, while rare in Aspergilli. The second group includes hybrids euascomycetes show multimodular NRPSs in response which are conserved in biseriate Nigri species, but also to gain and loss of domains. A. homomorphus, A. clavatus A. campestris and A. In contrast, hybrids have been neglected by phylo- ochraceroseus. The tree topology indicates this as com- genetic studies, although their combination of NRPS mon hybrid duplication in Aspergilli through its con- and PKS domains suggests an interesting evolutionary servation in many species. The short phylogenetic dis- history. The existing studies only focused on known tance of A. ibericus and A. sclerotiicarbonarius hy- hybrids [10]. Lawrence et al [11] have shown that a brids to A. campestris and A. ochraceroseus hybrids single Cochliobolus heterostrophus NRPS-PKS hybrid is remarkable, since these species are from different gene originates from Burkholderiales; which they sug- sections. Additionally, it shows how hybrids conserved gest to have taken place in the early development of in one subgroup, the biseriates of section Nigri, can the Pezizomycotina. be found in A. homomorphus — a member of uniseri- A model for the avirulence factor ACE1 gene [12] ates. Furthermore, reaquisition of related hybrids ap- by Khaldi et al. [13] shows gene duplication, loss, and pears to be common since A. ibericus and A. steynii horizontal gene transfer — a common event between contain duplications with larger phylogenetic distances fungi [14] — as driver in diversification of ACE1 hy- (Aspibe1 443386 and Aspibe1 469268; Aspste1 454498 brids. With an extended dataset, we sought to identify and Aspste1 477231). the origins of hybrids in Aspergillus, providing insight The third group contains pseurotin A and isoflavipucine into molecular evolution — the basis of chemical inno- linked hybrids. The presence of these hybrids in two vation. Understanding hybrid evolution and diversity sister clades is surprising since they use different amino will put rational engineering of these proteins into our acids and produce distinct compounds. However, the grasp and pave the way for new bioactive compounds. broad substrate acceptance of isoflavipucine, which Thus, we sought to investigate the phylogenetic dy- has been shown to accept many different substrates namics of PKS-NRPS and NRPS-PKS in the SM rich to create 63 diverse compounds [17], supports a com- genus Aspergillus and relate them to other classes of mon origin. A fourth group contains hybrids related SM genes. Furthermore, we identify origins of hybrids to chaetoglobosin, aspyridone and cyclopiazonic acid in bacteria and other fungal genera. hybrids. It is surprising that chaetoglobosin and cy- tochalasin hybrids show exceptional distance in our Results analysis. This suggests that, although the hybrids are Genus wide analysis identifies independent groups of similar in their biosynthetic activity, they evolved in- hybrids dividually. Recent work has highlighted the dynamics of SM Following are four groups which are unrelated to genes in fungi. NRPS are subject to duplication and the other hybrids, two consisting of PKS-NRPS, in- less, frequent domain gain and loss, with fungi specific cluding pyranonigrin related hybrids, and two groups groups emerging [15, 16]; and PKSs show great diver- of NRPS-PKS orientation. Pyranonigrin related hy- sity in fungi as well [8]. Hence we were interested why brids are, with the exception of one hybrid from A. PKS-NRPS and NRPS-PKS hybrids are relatively few steynii, unique for section Nigri (at least in the scope in Aspergillus species compared to PKSs and NRPSs of our dataset). Notably, NRPS-PKS hybrids are rare and whether we find the evolutionary events which among the analyzed species and are only present in led to hybrid evolution. Due to apparent independent a few species: the biseriates of section Nigri, A. in- parts, we expect a fusion of NRPS and PKS occured dologenus, A. steynii, A. campestris, and Penicillium during early fungal evolution. chrysogenum. Their absence in other species and low In order to investigate this, we created a maximum number points towards recent acquisition by horizon- likelihood phylogeny of hybrid proteins from Aspergilli tal gene transfer (HGT) of all NRPS-PKS. The phy- of section Nigri (including biseriates and uniseriates), logeny indicates two major groups of NRPS-PKS hy- Circumdati, Candidi, Flavi, Fumigati, Ochracerosii, brids and two hybrids as outgroups (Aspcam1 323099 95 Theobald et al. Page 3 of 14

and Aspste1 418130). While one major group is biseri- as well. A. saccharolyticus, a uniseriate, contains one ate specific, the other group consists of P. chrysogenum hybrid from biseriate species and one homolog which (Pench1 85311), hybrids from biseriates, and A. in- shows high conservation to a hybrid from A. steynii dologenus. A. luchuensis and A. piperis are the only and A. oryzae (Fig.1). The latter is responsible for species that carry hybrids from both major groups, cyclopiazonic acid (CPA) synthesis, a mycotoxin [19]. pointing towards a HGT before their speciation, or In the dataset, we included only few non Nigri retention of a hybrid. The position of P. chryso- species with often only one representative per section. genum hybrid in the phylogeny points towards HGT Hence, figure1 will show that all hybrids in these as well. The NRPS-PKS oriented hybrids show very species have homologs in species of other sections than large distance to the rest of hybrids in the phyloge- their own. This is of course, not the case, the analysis netic tree, caused by the different orientation and bias however still gives a good indication of the origin of of the alignment towards the more prominent PKS- their hybrids. As in the case for the isoflavipucine hy- NRPS oriented hybrids. Hence, we created alignments brid from A. terreus and a hybrid from A. campestris. of single adenylation and ketosynthase domains which Here, a short phylogenetic distance indicates HGT be- would provide an unbiased view of the relation of tween these species rather than hybrid conservation. NRPS-PKS and PKS-NRPS hybrids. Although not derived by HGT, but still worth men- tioning are hybrids from A. steynii. Its hybrids are Genus wide analysis provides evidence for HGT related to almost every subgroup of hybrids in the Horizontal gene transfer of hybrids has been observed dataset. for fungi of different genera. With the ML phylogeny A. steynii and section Nigri species contain a large established, it was an obvious step to extend our anal- number of diverse hybrids which are related to most ysis for detection of potential horizontal gene transfer. subgroups in the dataset. If new lineages of hybrids If all hybrids were inherited vertically, we would expect would frequently emerge throughout sections we would for a given synthetase that the best homologs (near- expect more section specific hybrids and A. steynii est neighbour in the tree) originate from the same sec- as well as Nigri species would cover less of the hy- tion as the synthetase itself. To find the best homologs brid groups. This suggests that the evolutionary events of hybrids for each species, we extracted distances of leading to hybrid generation happened before species hybrids from the ML phylogeny and classified them diversification in the genus Aspergillus. Another ob- according to origin (Fig.1). This analysis works best servation is that closely related hybrids are present in with the hybrid rich section Nigri, since we cover most species here, but also the other Aspergillus species and many phylogenetically distant sections. This suspects P. chrysogenum provide insights into hybrid dynamics. that diversification of hybrids is happening through Our analysis reveals that biseriates of section Ni- recombination events after HGT of hybrids. Addition- gri contain mostly conserved groups of hybrids, but ally, we expect that NRPS-PKS hybrids were either de- can contain some hybrids derived from other parts of rived by joining of independent NRPS and PKS genes the phylogeny (Fig.1). A. sclerotiicarbonarius con- or acquired independently from another source, since tains a related hybrid to the aspyridone gene from they show large phylogenetic distance to PKS-NRPS A. nidulans and A. heteromorphus contains a major- hybrids. Since the phylogenetic distance could be bi- ity of uniseriate hybrid homologs. A. ibericus contains ased by the amount of structurally similar PKS-NRPS one third of hybrid homologs from other sections. A. hybrids we created further comparisons on basis of sin- sclerotioniger contains a uniseriate homolog (protein gle domains. There are however, intrinsically large dis- id 605326), located in a group together with hybrids tances between PKS-NRPS distances as well which we from A. heteromorphus, a biseriate, P. chrysogenum wanted to investigate further. and A. clavatus ccsA. Thus, we predict this hybrid to produce sclerotionigrin [18], a related compound to cy- PKS analysis shows common ancestors for PKSs and tochalasins produced by ccsA. It is suprising however hybrids that A. sclerotioniger 605326 shows a shorter phylo- Since intrinsically, hybrids did show large phylogenetic genetic distance to the A. heteromorphus hybrid. Hy- distances (Fig.1), we hypothesized their origin from brids have been shown previously to be subject to loss related SM genes. Previous studies prove hybrid parts and reaquisition. This conservation could be the re- as exchangeable, hence, we proposed that other SM sults of a reaquisition of a hybrid from uniseriates to genes could join together in filamentous fungi to form biseriates — the apparent absence of ccsA homologs a hybrids. In order to study this, we created a ML in section Nigri supports a deletion event. phylogeny of 1369 ketosynthase (KS) domains of PKS- Uniseriates of section Nigri, with species containing like, PKS and hybrids to elucidate their phylogenetic less hybrids than the biseriates, show foreign homologs relations (Fig.2). 96 Theobald et al. Page 4 of 14

The tree topology shows multiple groups consisting nr database to find homologs of Aspergillus hybrid of PKS only, mixed PKS and hybrids and PKS-likes. genes. Adenylation domains from 288 best hits were PKS-likes form two unrelated groups, suggesting that extracted and added to the Aspergillus hybrid adeny- they are largely unrelated to PKSs. Hybrids are sep- lation domain set. Subsequent alignment and ML anal- arated into two groups. NRPS-PKS hybrids are lo- ysis generated the phylogeny in Fig.6. cated as sister clade to 6-methylsalicylic acid (6-MSA) Best blast hits were mostly originating from As- PKS related genes — including PKSs for synthesis comycete classes: Dothideomycetes, Eurotiomycetes, of yanuthones, terreic acid and patulin (Fig.3 right Leotiomycetes, Sordariomycetes, Xylonomycetes, one panel). PKS-NRPSs are clustering together with other Orbilliomycete, and one Exobasidiomycete hybrid were PKSs that frequently break into hybrid clades and included. Hybrids from bacterial classes include Pro- separate known examples from each other (Fig.3 left teobacteria, Terrabacteria and Planctomycetes. We panel). Thus, we suggest these PKSs and PKS-NRPS find fungal sequences distributed throughout the tree, hybrids to have common ancestors in fungi. PKS linked and although many ascomycete taxa are included, the to citreoviridin and pyripyroene as sister clade to hy- tree topology indicates that hybrids are conserved brids (the pyriporene hybrid has and adjacent adenyla- throughout these taxa. Certainly, our blast search tion (A) domain in its cluster). Thus these PKSs could might bias the tree topology, nonetheless, if PKS and be the ideal precursor for the molecular evolution of NRPS would recombine frequently in fungi, we would hybrids. expect more intermediates. There are some NRPS, In summary, hybrids do not form a monophyletic mostly from Sordariamycetes and Dothideomycetes, clade inside the ML phylogeny. Hence we can hypoth- which are related to PKS-NRPS hybrids. These could esize that PKSs and hybrids had common ancestors — be remnants of ancestral NRPS which have been distinct ones for NRPS-PKS and PKS-NRPS genes. donors for hybrids in fungi. The majority of the tree consists of PKS-NRPS hybrids, while NRPS-PKS hy- Additionally, the analysis shows that NRPS-PKS and brids from fungi and bacterial NRPSs and hybrids PKS-NRPS hybrids are unrelated as indicated by the (from Terrabacteria and Proteobacteria), are coclus- phylogeny of hybrids (Fig.1). tering in one location (Fig.6). Inside the cluster, we can identify the thanamycin hybrid gene from Pseu- Phylogeny of NRPSs and hybrids reveals monophyletic domonas sp. SHC52, a lipopeptide. What’s more, clade of hybrids we can identify hybrids KPC78190.1, APD71785.1, Hybrids incorporate amino acids (e.g. tyrosine in case WP 023586037.1 from Streptomyces sp. in a sister of the cytotoxic aspyridone or L-phenylalanine in case clade to hybrids from multiple fungal genera. This of cytochalasins) into compounds in a manner similar indicates that lipopeptides could be a progenitor of to NRPS and NRPS-likes. Thus we sought to investi- NRPS-PKS in general and a hybrid from Streptomyces gate the phylogenetic relationship of these proteins. could be horizontally transferred to fungi — giving rise We created a maximum likelihood tree of 2428 to NRPS-PKS hybrids in fungi. adenylation domains from NRPS, NRPS-like and hy- The phylogeny also shows related hybrids of differ- brid proteins which led to mostly monophyletic groups ent sections in the same branch, as in the case for (Fig.4). NRPS-likes form two groups which are mono- genes of aspyridone and fumosorinone — two com- phyletic. Other groups comprise NRPS which appear pounds similar in structure [20]. This emphasizes that to have a common ancestor, with few NRPS-likes form- the structural diversity of hybrids throughout As- ing a sister clade. Hybrids form a monophyletic group, comycetes might be limited. PKS-NRPS homologs of they are however located in a sister clade with a group Aspergilli are co-clustering with many hybrids from of NRPS and NRPS-likes conserved in uniseriate Ni- Sordariamycetes and Eurotiomycetes. Thus, the dis- gri species (Fig.5, NRPS homologs are also found in tances of pyranonigrin associated hybrids observed A. heteromorphus and A. ellipticus). These proteins earlier (Fig.1) can now be explained with the added could possibly have a common ancestor. dataset. The A. ellipticus hybrid (460246) is cluster- ing closer together with Sordariamycetes; the same for Extended analysis of hybrids shows two events leading A. steynii which carries an Eurotiomycete related hy- to hybrid development brid. Recurrence of the same genera emphasizes that Since the comparison of A domains between NRPS, hybrids might be limited in fungi, which is why there NRPS-likes and hybrids did not deliver the same are usually so few. amount of ancestors we observed for PKS, we sought to extend our search space to the National Center Discussion for Biotechnology Information (NCBI) non-redundant Analysis of the whole SM protein repertoire of a genus protein database. We used protein blast on the NCBI wide dataset led us to discover dynamics between 97 Theobald et al. Page 5 of 14

NRPSs, NRPS-likes, PKSs, PKS-likes and NRPS-PKS SM related to cytochalasins from A. clavatus. Dupli- as well as PKS-NRPS hybrids. While previous studies cation, loss and reacquisition of related hybrids ap- included hybrids related to known examples [10], fo- pears to be common since A. ibericus and A. steynii cused on NRPS domains [9], or where considering sin- contain duplications with larger phylogenetic distances gle hybrids for analysis [11, 13] we combined analysis (Aspibe1 443386 and Aspibe1 469268; Aspste1 454498 of A and KS-domains through all hybrids in a genus and Aspste1 477231). This duplication, loss, and reac- wide species set and related them to other SM enzymes quisition pattern is similar to the one proposed for to focus on their evolutionary history. ace1. Ace1 is duplicated during Sordariamycete evo- Gallo et. al [10] also showed that the hybrid and lution to ace1 and syn2. In an HGT event, syn2 is NRPS adenylation domains are monophyletic — which transferred from Magnaporthe grisea to A. clavatus our analysis supports. Yet, we could identify a sister resulting in ccsA — an organism which lost the orig- clade of NRPS and NRPS-likes which groups together inal ace1 ancestor [13]. Additionally, we find several with hybrids. These genes are valuable leads in the cases where phylogenetic distances to hybrids of other investigation of SM gene evolution in the genus As- sections are shorter than to hybrids of their own sec- pergillus. Additionally, one common ancestor of PKS tion. For example, A. heteromorphus, although biseri- and Hybrid PKS-NRPS had been hypothesized [10]. ate, contains mostly hybrid homologs from from unis- Our results indicate multiple PKSs which could be the eriate species and A. saccharolyticus contains a ho- ancestor of hybrids. We can confirm that despite struc- molog to a hybrid from of A. flavus. This is further tural (or biosynthetic) similarity of cytochalasins and supported by our extended analysis using additional chaetoglobosisn, the genes for their synthetases have a distinct phylogenetic history — ccsA, the cytochalasin hybrids from other fungi and bacteria. A. ellipticus hybrid, is more closely related to the equisetin hybrid and A. steynii show homologs to Sordariomycetes and than to the chaetoglobosin hybrid. Eurotiomycetes, respectively — indicating a loss and Previous studies indicated that NRPS-PKS are of reaquisition of a pyranonigrin related hybrid. The phy- bacterial origin, while PKS-NRPS due to their abun- logenetic relationship of hybrids points to few events dance in fungi are of fungal origin [11]. Our analysis which yielded novel hybrids, the phylogenetic distances shows that only A and KS domains of NRPS-PKS however, exclude a loss only diversification. We suggest hybrids have similar phylogenetic histories in fungi that hybrids are frequently transferred, hence the short (as indicated in [11]). The extended comparison to distances betweeen hybrids of distantly related species hits from the NCBI non redundant protein database and the finite number of clades/ their low amount. in this study revealed that the collective of NRPS- Aspergilli show vast amounts of SM genes; hybrids PKS in Ascomycetes is related to bacterial hybrids however, are the minority (Table1). These low counts, and lipopeptide synthetases, suggesting bacterial ori- their dynamics and their unrelated groups1 pose the gin. PKS-NRPS hybrids on the other hand, show sim- question whether independent events lead to their evo- ilarity to different fungal PKSs in the ML phylogeny. lution and whether their diversity is finite. Hence, we suspect that A and KS domains have a dif- Hybrids were subject to many engineering efforts ferent phylogenetic history. Their monophyly in A do- since their structure suggested inpendendent units, main comparisons suggests that one NRPS ancestor which could be recombined. TenS and DmbS, part of was able to recombine with different PKSs. Hence our analysis shows that multiple events rather than one the hybrids producing the similar compounds tenellin event gave rise to hybrid evolution. and desmethylbassianin, respectively, could be fused According to previous studies, the ccsA ancestor together and resulted in the production of tenellin A was duplicated in Ascomycete development, lost in [21]. Nielsen et. al [22] created a fusion of CcsA and Aspergilli, and reacquired by A. clavatus from Mag- Syn2 resulting in niduporthin, a novel chimeric com- naporthe grisea [13]. In our analysis of SM genes pound. Despite these successful recombinations, creat- throughout 39 Aspergillus strains — with addition ing a chimera of hybrids remained challenging as other of best hits from the NCBI non redundant protein studies showed a limitation to recombination efforts database — we can further provide evidence for HGT [23]. Our study will facilitate further efforts to engineer of hybrids through ascomycetes. Biseriate species A. hybrids since we could show their dynamics inside the sclerotioniger, A. heteromorphus as well as uniseri- genus Aspergillus and relate it to other fungal genera. ate species A. saccharolyticus, all of which belong to Especially NRPS and PKS which have been identified section Nigri, contain conserved ccsA homologs (Fig. in this study as hybrid related might be amenable to 1). A. sclerotioniger is producer of sclerotionigrin, a modification. 98 Theobald et al. Page 6 of 14

Methods References Data collection 1. Wang, J., Jiang, Z., Lam, W., Gullen, E.A., Yu, Z., Wei, Y., Wang, L., Zeiss, C., Beck, A., Cheng, E.C., Wu, C., Cheng, Y.C., Zhang, Y.: Protein sequences and smurf annotations for As- Study of malformin C, a fungal source cyclic pentapeptide, as an pergillus and Penicillium species were downloaded anti-cancer drug. PLoS ONE 10(11), 1–19 (2015). http://genome.jgi.doe.gov/ doi:10.1371/journal.pone.0140069 from JGI ( ). 2. Awakawa, T., Yang, X.L., Wakimoto, T., Abe, I.: Pyranonigrin E: A PKS-NRPS hybrid metabolite from aspergillus niger identified by Genetic dereplication genome mining. ChemBioChem 14(16), 2095–2099 (2013). doi:10.1002/cbic.201300430 Secondary metabolite proteins were annotated with 3. Tokuoka, M., Seshime, Y., Fujii, I., Kitamoto, K., Takahashi, T., known compounds through BLAST against examples Koyama, Y.: Identification of a novel polyketide synthase-nonribosomal peptide synthetase (PKS-NRPS) gene required for the biosynthesis of from the Minimum information on biosynthetic gene cyclopiazonic acid in Aspergillus oryzae. Fungal Genetics and Biology clusters (MIBiG) database [24]. 45(12), 1608–1615 (2008). doi:10.1016/j.fgb.2008.09.006 4. Casella, J.F., Flanagan, M.D., Lin, S.: Cytochalasin D inhibits actin polymerization and induces depolymerization of actin filaments formed Identifying best hits in the NCBI non redundant protein during platelet shape change. Nature 293(5830), 302–305 (1981). database doi:10.1038/293302a0 5. McDaniel, R., Ebert-Khosla, S., Hopwood, D.a., Khosla, C.: Rational Adenylation domains of hybrids were blasted against design of aromatic polyketide natural products by recombinant the NCBI non redundant protein database (down- assembly of enzymatic subunits. (1995). doi:10.1038/375549a0 loaded from ftp://ftp.ncbi.nlm.nih.gov/blast/ 6. Ridley, C.P., Lee, H.Y., Khosla, C.: Evolution of polyketide synthases in bacteria. Proceedings of the National Academy of Sciences of the db/) using protein basic local alignment search tool United States of America 105(12), 4595–600 (2008). (BLAST) [25]. The best 15 hits which suffice a query doi:10.1073/pnas.0710107105 coverage cutoff of 80% were retained. Taxonomic labels 7. Throckmorton, K., Wiemann, P., Keller, N.P.: Evolution of Chemical Diversity in a Group of Non-reduced Polyketide Gene Clusters: Using for best hits were added using ETE 3 toolkit [26]. Phylogenetics to Inform the Search for Novel Fungal Natural Products vol. 7, pp. 3572–3607 (2015). doi:10.3390/toxins7093572 8. Koczyk, G., Dawidziuk, A., Popiel, D.: The distant siblings - A Protein domains, alignment and maximum liklihood phylogenomic roadmap illuminates the origins of extant diversity in analysis fungal aromatic polyketide biosynthesis. Genome Biology and InterproScan5 (version 5.22-61.0) [27] was used to Evolution 7(11), 3132–3154 (2015). doi:10.1093/gbe/evv204 9. Bushley, K.E., Turgeon, B.G.: Phylogenomics reveals subfamilies of identify domains on protein sequences. Domains of the fungal nonribosomal peptide synthetases and their evolutionary same type which were under 350 amino acids long and relationships. BMC evolutionary biology 10(1), 26 (2010). doi:10.1186/1471-2148-10-26 less than 100 amino acids apart were merged before 10. Gallo, A., Ferrara, M., Perrone, G.: Phylogenetic study of polyketide proceeding. Sequences were handled using Biopython synthases and nonribosomal peptide synthetases involved in the [28]. Resulting domain sequences were aligned using biosynthesis of mycotoxins. Toxins 5(4), 717–742 (2013). doi:10.3390/toxins5040717 Clustal Omega [29] and cut using trimal [30]. In case 11. Lawrence, D.P., Kroken, S., Pryor, B.M., Arnold, A.E.: Interkingdom of the full hybrid protein tree, full protein sequences gene transfer of a hybrid NPS/PKS from bacteria to filamentous were aligned and trimmed. IQ-tree [31] was used on Ascomycota. PloS one 6(11), 28231 (2011). doi:10.1371/journal.pone.0028231 aligned sequences using Model Finder Plus [32] and 12. B¨ohnert,H.U., Fudal, I., Dioh, W., Tharreau, D., Notteghem, J.-L., 1000 times ultra fast bootstrap [33]. Lebrun, M.-H.: A putative polyketide synthase/peptide synthetase from Magnaporthe grisea signals pathogen attack to resistant rice. The Plant cell 16(9), 2499–513 (2004). doi:10.1105/tpc.104.022715 Finding best homologs in trees 13. Khaldi, N., Collemare, J., Lebrun, M.-H., Wolfe, K.H.: Evidence for A distance matrix was extracted from the ML tree horizontal transfer of a secondary metabolite gene cluster between fungi. Genome biology 9(1), 18 (2008). doi:10.1186/gb-2008-9-1-r18 using the cophenetic function of the ape package [34] 14. Fitzpatrick, D.a.: Horizontal gene transfer in fungi. FEMS Microbiology in R [35]. Best homologs were plotted using ggplot2 Letters 329(1), 1–8 (2012). doi:10.1111/j.1574-6968.2011.02465.x [36] 15. Bushley, K.E., Ripoll, D.R., Turgeon, B.G.: Module evolution and substrate specificity of fungal nonribosomal peptide synthetases involved in siderophore biosynthesis. BMC evolutionary biology 8(1), Visualization 328 (2008). doi:10.1186/1471-2148-8-328 16. Bushley, K.E., Turgeon, B.G.: Phylogenomics reveals subfamilies of Phylogenetic trees were visualized using ggtree [37]. fungal nonribosomal peptide synthetases and their evolutionary ggstance [38]. matplotlib [39]. Gene clusters were plot- relationships. BMC evolutionary biology 10, 26 (2010). ted using Easyfig [40]. doi:10.1186/1471-2148-10-26 17. Qiao, K., Zhou, H., Xu, W., Zhang, W., Garg, N., Tang, Y.: A fungal nonribosomal peptide synthetase module that can synthesize thiopyrazines. Organic Letters 13(7), 1758–1761 (2011). Competing interests doi:10.1021/ol200288w The authors declare that they have no competing interests. 18. Petersen, L.M., Bladt, T.T., D¨urr,C., Seiffert, M., Frisvad, J.C., Gotfredsen, C.H., Larsen, T.O.: Isolation, structural analyses and Acknowledgements biological activity assays against chronic lymphocytic leukemia of two ST, TCV, and MRA acknowledge funding from The Villum Foundation, novel cytochalasins - Sclerotionigrin A and B. Molecules 19(7), grant VKR023437 99 Theobald et al. Page 7 of 14

9786–9797 (2014). doi:10.3390/molecules19079786 computational molecular biology and bioinformatics. Bioinformatics 19. Sorenson, W.G., Tucker, J.D., Simpson, J.P.: Mutagenicity of the 25(11), 1422–1423 (2009). doi:10.1093/bioinformatics/btp163. tetramic mycotoxin cyclopiazonic acid. Applied and Environmental arXiv:1011.1669v3 Microbiology 47(6), 1355–1357 (1984) 29. Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., 20. Liu, L., Zhang, J., Chen, C., Teng, J., Wang, C., Luo, D.: Structure Lopez, R., McWilliam, H., Remmert, M., S¨oding,J., Thompson, J.D., and biosynthesis of fumosorinone, a new protein tyrosine phosphatase Higgins, D.G.: Fast, scalable generation of high-quality protein multiple 1B inhibitor firstly isolated from the entomogenous fungus Isaria sequence alignments using Clustal Omega. Molecular systems biology fumosorosea. Fungal Genetics and Biology 81, 191–200 (2015). 7(1), 539 (2011). doi:10.1038/msb.2011.75. 0-387-31073-8 doi:10.1016/j.fgb.2015.03.009 30. Capella-Guti´errez, S., Silla-Mart´ınez,J.M., Gabald´on,T.: trimAl: A 21. Heneghan, M.N., Yakasai, A.A., Williams, K., Kadir, K.A., Wasil, Z., tool for automated alignment trimming in large-scale phylogenetic Bakeer, W., Fisch, K.M., Bailey, A.M., Simpson, T.J., Cox, R.J., analyses. Bioinformatics 25(15), 1972–1973 (2009). Lazarus, C.M.: The programming role of trans-acting enoyl reductases doi:10.1093/bioinformatics/btp348 during the biosynthesis of highly reduced fungal polyketides. Chemical 31. Nguyen, L.T., Schmidt, H.A., Von Haeseler, A., Minh, B.Q.: Science 2(5), 972 (2011). doi:10.1039/c1sc00023c IQ-TREE: A fast and effective stochastic algorithm for estimating 22. Nielsen, M.L., Isbrandt, T., Petersen, L.M., Mortensen, U.H., maximum-likelihood phylogenies. Molecular Biology and Evolution Andersen, M.R., Hoof, J.B., Larsen, T.O.: Linker flexibility facilitates 32(1), 268–274 (2015). doi:10.1093/molbev/msu300 module exchange in fungal hybrid PKS-NRPS engineering. PLoS ONE 32. Kalyaanamoorthy, S., Minh, B.Q., Wong, T.K.F., von Haeseler, A., 11(8), 1–18 (2016). doi:10.1371/journal.pone.0161199 Jermiin, L.S.: ModelFinder: Fast Model Selection for Accurate 23. Boettger, D., Bergmann, H., Kuehn, B., Shelest, E., Hertweck, C.: Phylogenetic Estimates. Nature Methods 14(6), 587–591 (2017). Evolutionary Imprint of Catalytic Domains in Fungal PKS-NRPS doi:10.1038/nmeth.4285 Hybrids. ChemBioChem 13(16), 2363–2373 (2012). 33. Minh, B.Q., Nguyen, M.A.T., Von Haeseler, A.: Ultrafast doi:10.1002/cbic.201200449 approximation for phylogenetic bootstrap. Molecular Biology and 24. Medema, M.H., Kottmann, R., Yilmaz, P., Cummings, M., Biggins, Evolution 30(5), 1188–1195 (2013). doi:10.1093/molbev/mst024 J.B., Blin, K., De Bruijn, I., Chooi, Y.H., Claesen, J., Coates, R.C., 34. Paradis, E., Claude, J., Strimmer, K.: APE: Analyses of Phylogenetics Cruz-Morales, P., Duddela, S., D¨usterhus,S., Edwards, D.J., Fewer, and Evolution in R language. Bioinformatics (Oxford, England) 20(2), D.P., Garg, N., Geiger, C., Gomez-Escribano, J.P., Greule, A., 289–90 (2004) Hadjithomas, M., Haines, A.S., Helfrich, E.J.N., Hillwig, M.L., Ishida, 35. R Core Team: R: A Language and Environment for Statistical K., Jones, A.C., Jones, C.S., Jungmann, K., Kegler, C., Kim, H.U., Computing. R Foundation for Statistical Computing, Vienna, Austria K¨otter,P., Krug, D., Masschelein, J., Melnik, A.V., Mantovani, S.M., (2017). R Foundation for Statistical Computing. Monroe, E.A., Moore, M., Moss, N., N¨utzmann,H.W., Pan, G., Pati, https://www.r-project.org/ A., Petras, D., Reen, F.J., Rosconi, F., Rui, Z., Tian, Z., Tobias, N.J., 36. Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, Tsunematsu, Y., Wiemann, P., Wyckoff, E., Yan, X., Yim, G., Yu, F., ??? (2009). http://ggplot2.org Xie, Y., Aigle, B., Apel, A.K., Balibar, C.J., Balskus, E.P., 37. Yu, G., Smith, D., Zhu, H., Guan, Y., Lam, T.T.-Y.: ggtree: an R Barona-G´omez,F., Bechthold, A., Bode, H.B., Borriss, R., Brady, S.F., package for visualization and annotation of phylogenetic trees with Brakhage, A.A., Caffrey, P., Cheng, Y.Q., Clardy, J., Cox, R.J., De their covariates and other associated data. Methods in Ecology and Mot, R., Donadio, S., Donia, M.S., Van Der Donk, W.A., Dorrestein, Evolution (2017). doi:10.1111/2041-210X.12628 P.C., Doyle, S., Driessen, A.J.M., Ehling-Schulz, M., Entian, K.D., 38. Henry, L., Wickham, H., Chang, W.: ggstance: Horizontal ’ggplot2’ Fischbach, M.A., Gerwick, L., Gerwick, W.H., Gross, H., Gust, B., Components (2016) Hertweck, C., H¨ofte,M., Jensen, S.E., Ju, J., Katz, L., Kaysser, L., 39. Hunter, J.D.: Matplotlib: A 2D graphics environment. Computing In Klassen, J.L., Keller, N.P., Kormanec, J., Kuipers, O.P., Kuzuyama, Science & Engineering 9(3), 90–95 (2007). T., Kyrpides, N.C., Kwon, H.J., Lautru, S., Lavigne, R., Lee, C.Y., doi:10.1109/MCSE.2007.55 Linquan, B., Liu, X., Liu, W., Luzhetskyy, A., Mahmud, T., Mast, Y., 40. Sullivan, M.J., Petty, N.K., Beatson, S.A.: Easyfig: A genome M´endez,C., Mets¨a-Ketel¨a,M., Micklefield, J., Mitchell, D.A., Moore, comparison visualizer. Bioinformatics 27(7), 1009–1010 (2011). B.S., Moreira, L.M., M¨uller,R., Neilan, B.A., Nett, M., Nielsen, J., doi:10.1093/bioinformatics/btr039 O’Gara, F., Oikawa, H., Osbourn, A., Osburne, M.S., Ostash, B., Payne, S.M., Pernodet, J.L., Petricek, M., Piel, J., Ploux, O., Figures Raaijmakers, J.M., Salas, J.A., Schmitt, E.K., Scott, B., Seipke, R.F., Tables Shen, B., Sherman, D.H., Sivonen, K., Smanski, M.J., Sosio, M., Stegmann, E., S¨ussmuth,R.D., Tahlan, K., Thomas, C.M., Tang, Y., Truman, A.W., Viaud, M., Walton, J.D., Walsh, C.T., Weber, T., Van Wezel, G.P., Wilkinson, B., Willey, J.M., Wohlleben, W., Wright, G.D., Ziemert, N., Zhang, C., Zotchev, S.B., Breitling, R., Takano, E., Gl¨ockner,F.O.: Minimum Information about a Biosynthetic Gene cluster. Nature Chemical Biology 11(9), 625–631 (2015). doi:10.1038/nchembio.1890 25. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.: BLAST+: architecture and applications. BMC Bioinformatics 10(1), 421 (2009). doi:10.1186/1471-2105-10-421 26. Huerta-Cepas, J., Serra, F., Bork, P.: ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution 33(6), 1635–1638 (2016). doi:10.1093/molbev/msw046 27. Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S.Y., Lopez, R., Hunter, S.: InterProScan 5: Genome-scale protein function classification. Bioinformatics 30(9), 1236–1240 (2014). doi:10.1093/bioinformatics/btu031 28. Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., De Hoon, M.J.L.: Biopython: Freely available Python tools for

100 Theobald et al. Page 8 of 14

Table 1 Number of hybrids and total SM proteins in Aspergillus species and Penicillium chrysogenum. SM proteins comprise PKS, PKS-like, NRPS, NRPS-like, Hybrids, dimethylallyltryptophan synthase and terpene cyclases. Complete name Jgi name Section/Group hybrids Total SM proteins A. campestris Aspcam1 Candidi 5 48 A. steynii Aspste1 Circumdati 10 98 A. clavatus Aspcl1 Clavati 4 43 A. flavus Aspfl1 Flavi 3 72 A. oryzae Aspor1 Flavi 2 68 A. fumigatus A1163 Aspfu A1163 1 Fumigati 2 36 A. fumigatus Af293 Aspfu1 Fumigati 1 37 A. novofumigatus Aspnov1 Fumigati 4 59 A. nidulans Aspnid1 Nidulantes 1 63 A. brasiliensis Aspbr1 Niger biseriates 5 82 A. carbonarius Aspca3 Niger biseriates 5 65 A. costaricaensis Aspcos1 Niger biseriates 5 97 A. ellipticus Aspell1 Niger biseriates 2 72 A. eucalypticola Aspeuc1 Niger biseriates 6 78 A. heteromorphus Asphet1 Niger biseriates 3 58 A. ibericus Aspibe1 Niger biseriates 6 57 A. luchuensis CBS 106.47 Aspfo1 Niger biseriates 6 93 A. luchuensis IFO 4308 Aspka1 1 Niger biseriates 6 88 A. neoniger Aspneo1 Niger biseriates 5 82 A. niger ATCC 1015 Aspni7 Niger biseriates 9 86 A. phoenicis Aspph1 Niger biseriates 9 89 A. piperis Asppip1 Niger biseriates 6 86 A. sclerotiicarbonarius Aspscle1 Niger biseriates 6 82 A. sclerotioniger Aspscl1 Niger biseriates 5 75 A. tubingensis Asptu1 Niger biseriates 7 93 A. vadensis Aspvad1 Niger biseriates 6 85 A. welwitschiae Aspwel1 Niger biseriates 7 87 A. aculeatinus Aspacu1 Niger uniseriates 3 90 A. aculeatus Aspac1 Niger uniseriates 4 73 A. brunneoviolaceus Aspbru1 Niger uniseriates 2 85 A. fijiensis Aspfij1 Niger uniseriates 4 94 A. homomorphus Asphom1 Niger uniseriates 3 78 A. indologenus Aspind1 Niger uniseriates 6 92 A. saccharolyticus Aspsac1 Niger uniseriates 2 52 A. uvarum Aspuva1 Niger uniseriates 3 76 A. violaceofuscus Aspvio1 Niger uniseriates 5 79 A. ochraceoroseus Aspoch1 Ochraceorosei 2 30 P. chrysogenum Pench1 Penicillium 2 50 A. terreus Aspte1 Terrei 1 73

101 Theobald et al. Page 9 of 14

Section/ subgroup Candidi Circumdati Clavati Flavi Fumigati Nidulantes Pyranonigrin E Niger_biseriates

Niger_uniseriates ● ● ● Ochraceorosei ● ● ● Penicillium ● ● ● Terrei ● 90 ● ● E ● ● Orientation ● ● ● ● N−type ● P−type

100% A A. flavus 36768

89 (CPA hybrid) 63% 94 Cyclopiazonic acid A. saccharolyticus 388526

A. steynii 88 D 512973 86 Chaetoglobosin B A. campestris Aspyridone A. steynii A. clavatus A. oryzae A. flavus A. novofumigatus A. fumigatus Af293 A. fumigatus A1163 A. nidulans A. welwitschiae A. vadensis A. tubingensis C A. sclerotioniger A. sclerotiicarbonarius A. piperis A. phoenicis Pseurotin A A. niger ATCC 1015 A. neoniger A. luchuensis IFO 4308 Isoflavipucine A. luchuensis CBS 106.47 A. ibericus A. heteromorphus A. eucalypticola A. ellipticus A. costaricaensis A. carbonarius A. brasiliensis A. violaceofuscus A. uvarum A. saccharolyticus A. indologenus A. homomorphus A. fijiensis 64 A. brunneoviolaceus A. aculeatus A. aculeatinus A. ochraceoroseus B 86 P. chrysogenum 85 A. terreus 0 1 2 3 4 5 6 7 8 9 10

57 100% 89 C A. heteromorphus 337005

64%

A. saccharolyticus 376297 A 82 92 Cytochalasin A. sclerotioniger 605326

A. clavatus 6366 (ccsA) 0.1

Figure 1 Hybrid dynamics throughout Aspergilli ML phylogeny of PKS-NRPS and NRPS-PKS hybrid proteins was created on aligned and trimmed protein sequences. Red letters indicate subtrees shown in Fig. S1. Sections and species groups indicated by tip color; Orientation of hybrids N-type (NRPS-PKS) and P-type (PKS-NRPS) indicated by tip shape. (A) Synteny plots of cyclopiazonic acid related hybrids. (B) Classification of nearest neighbors of ML phylogeny. A matrix of tip distances was extracted from the tree and nearest neighbors classified according to their section. (C) Synteny plot of cytochalasin related hybrids (ccsA in A. clavatus). A. sclerotioniger is known to produce sclerotionigrin — a cytochalasan

102 Theobald et al. Page 10 of 14

●● ●●●●● ●●● ●●●● Asperfuranone SM type ●● ●●● ● ●● ●● Azanigerone ●●●● ● ●●● Sorbicillin HYBRID ●●●● ●●● ●●●● ●●●● ●● ● ●●● PKS ●●● ●● ●●●● Terretonin ●● ●●● Austinol/dehydroaustinol, locus 1 ●●●●● ● ●●●● PKS−Like ●●●●●● ●●● Cichorine ● ● ● ●●●● ●● ●● ●●● ● ● Naphthopyrone ●● ●●● ●● ● ●●● ● ● Aflatoxin/sterigmatocystin ●●● ● ●●● ●●● ● ● Emericellin TAN−1612 Trypacidin ●●●● ●●● ●●● Alternariol Aflavarin Asperthecin ● ●● Terrein ● ● ●●● ●● ●● Endocrocin ● ●● ● F9775 ●●●● ● ●●●● ● ●● ● ●●● ●● ● ● ● ●● ●●● ●●●● ●● ● ● ●● ● ●●● ●●●● ● ● ● ●● ●● ●●● ● ● ●●● ●● ●●●● ● ●●●● Patulin Terreic acid Yanuthone D ● ● ●●● ●● ●●● ●● ●●●● ●●●● ● ●● ●●●● ●● ● ● ●● ●●● ●●●● ● ●●●●● ●● ● ● ●● ●●●● ●● ●● ●● ●●● ●● ● ●●● ● ● ●●● ●● ● ● ●● ● ●● ●●●●● ●●● ●● ●●●● ● ● ●●●● ●●● ● ●●●● ●● ● ● ● ● ● ● ●●● ●●●● ●●●● ●●●●● ●● ●● ●● ● ● ● ●●● ●●● ●●●● ● ● ●● ● ●●●● ● ●● ● ●● ●● ● ●●● ● ● ●●● ●● ● ●●●● ● Fumagillin ●● ●● ● ● Cytochalasin ● ● ●● ●●● ●● ●●● ●● ● ●●●● ● ●●● ●●●● ●● ●● ● ●● ●●● ●●● ●●● ●●●●● ●● ●●● ●●● Pyripyropene ●●● Citreoviridin ● ●●●● ●● Pseurotin A ●●● Isoflavipucine ●●● ● ● ●●●● ●●●● Chaetoglobosin ●●●●● Aspyridone ● ● ●● ●●●● ● Cyclopiazonic acid ● ●●● ● ● ●●● ● Pyranonigrin E ●●●● ●●●● ● ●●● ●● ● ●●● ●●●●● Lovastatin ●● ●● ●●●●● ● ●● ●●● ●●●●● ●● ●● ●●● ● ● ●● ●● ●●●● ● ● ●●●● ● ●●● ●●● ● ●●● ● ●●●●● ● ●●● ●●● ●●●● ● ●●●● ● ● ● ●● ● ●●● ● ●● ● ●●●●● ●●● ●●●● ●●● ●● ●●● ● ●●●● ●●●● ●●●●● ● ●●● ●●● ●●● ● ● ● ● 0.5 ●●● ● ●●●● ●● ● ●●●●

0.0 2.5 5.0 7.5 10.0

Figure 2 Phylogeny of PKS, PKS-like and hybrid proteins. The maximum likelihood phylogeny was created from KS domains of PKS, PKS-like and hybrid proteins. Tip color shows SM gene type (red: hybrid, green: PKS, blue: PKS-like). Hybrids linked to compounds are labelled with the compound name. Grey boxes highlight groups containing hybrids.

103 Theobald et al. Page 11 of 14

A B

● ● ● ● ●● ● ● 64 ●● ● 61 ● ● 18 ● 65 ● ●● Cytochalasin ● ●● 21 ●● ● ● ● 62 ●● ● ● 66 ● ● ● ● ● ● ●● ●● ● ● ● ●● 88 ● SM type ●● SM type ● ●● ●● ● ● ● HYBRID ●● HYBRID ● ● ● 71 ● ● ● ● ● ● PKS ● PKS ●● ● ● ● ● PKS−Like ●● ● PKS−Like Patulin ●● ● ● 65 ●● ● ●● Terreic acid ●● ● ● ● ●● ●● ● ● ●● 74 ● ●● ●● ● 74 ●● ● ● ● ● ● ● ●● ● 69 ●● ● ● ● Yanuthone D ●● 67 ●● 94 94 ● ● ● ●● ●● Pyripyropene ●● ● ● ● ● Citreoviridin ● ● ●● ● ● ● Pseurotin A ●● ●Isoflavipucine● ● ●● ●● ● ● ● ●● ● ● ● 92 ● ● ● ● ● ● 43 ● ● 73 ●● ● ● ●● ● ●● ● ●● Chaetoglobosin ● ● ●● Aspyridone ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● Cyclopiazonic acid ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● 93 ● 91 ● Pyranonigrin E 91 ● ● ● 79 ● ● ● ●● ● ● ●●

2.0 2.5 3.0 3.5 2.5 3.0 3.5

Figure 3 Phylogeny of PKS, PKS-like and hybrid proteins Left: PKS-NRPS containing subtree; Right: NRPS-PKS containing subtree. The maximum likelihood phylogeny was created from KS domains of PKS, PKS-like and hybrid proteins. Tip color shows SM gene type (red: hybrid, green: PKS, blue: PKS-like). Tip labels show associated compound, if applicable. Bootstrap values under 95 are shown in red. Hybrids linked to compounds are labelled with the compound name.

104 Theobald et al. Page 12 of 14

● ● ●● ●● ● ●●● ● ●● ● ●● ● ●●● ●●●● ●●●●●●●●●●● ● ●● ●●●●●●● ● ●● ●●●●●●●● ● ● ●●●●●●● ● ● ●●●●●●●● ●● ●●●●● ●●●●● ●●●●●●● ●●● ●●●● ●●●●● ●● ● ●● ●●● ●● Fumiquinazolines, NA ●●●● ●●●●● ●●●●●●●●● ● ●● ●●●●●● NA, Fumitremorgin ●●●●●●● Fumiquinazolines, NA ● ●●●● ●●● ●● ● ●●● Roquefortine C / meleagrin ●●● ●● ●●●● ●●●●●●● ● Roquefortine C / meleagrin ●● ●●● ● ●●●● ●●●●●●● ●●●●●● ● ● ●●●●●● ● ●●● ●● ●●●●●●●● ●●●●●● ●● ● ●●●●● ● ●●●●●● ● ● ●●● ●●●●●●● ●●●●● Ferrichrome ●●●●● ●● ●● ●●●●● ●●●●●●●●● ●●●●● Fungisporin ●●●●●● ● ●●●● ●●● ●●●● Fungisporin ●●●●●● ●●●●●●●●● ●● ●● ●● Fellutamide B ●●●● ●● ● ●●● ●●●●Fungisporin● ●● ●● ●●●●●●● ●●●●●● ●● ● ● ●● ● ●●● ● Fungisporin ●●●●● ●●●●●●●●●● ● ● ●●●● ●●●●●● ●●●● ●●●● ●● ● ●● ●●●●● ● ● ●●●● ● ● ●●●●● ● ●● ● ●●● ● ●● ●●● ●●●●●●● ●●●●●●●● ● ●●●●●●● ● ●●● Emericellamide A / emericellamide B ●● ●●●●● ● ● ●●●●● ● ●● ●●●●● ●●●●●●● ● ●●●●● ●●●●● ●●● ●● ● ●●●●●●●● ●●●●● ●● Fumitremorgin ●● ●●●●● SM type ●●●●●●●●●●● ●●●●●●●● ● ●●●●● ● ● ●●●●●●● HYBRID ●●●●●●●●● ● ●●●●● ● ● ●●●●●●●●●● ● ●● ●●●● Fumiquinazolines ● ● ●●●●●● NRPS ●●●●● ● ●●● ● ●●●●●● ●●●●● ● ● ●●●● ● NRPS−Like ●●●●● ●● ●●● ●●●●●●● ●●● ●●●● ●●●●● ● ●● ●●●● ● NRPS−Like, NRPS ●●●●● ●● ●●● ●●● ●● ● ●● ●●●●●●● ●● ●● ●● ●● ●●●●●●●● ●●●●● ●●●● ● ●●●● ● ●●●● ●●● ●●●● ●● ● ●●● ● ●●●●● ●●●●● ●●●●● ●●●●● ●●● ●● ● ●● ●●● ●●●● ● ●●●●●●●● ●●Aspirochlorine ●●● ●● ●●● ●● ●● ●●● Hexadehydro−astechrome (HAS) ●●●●●●●●● ●●●●●●● ●●●●●●● ● ●●●●● ●● ●●● ● ●●●● ●●● ●●●●● ● ●●●●● ● ● ● ●●●●●●●● ● ● ●● ●●●●●●● ●●●● ●● ●●●●● ●● ●●●●●● Penicillin● ●●●●●●● ● ●● ●●● ●● ●●Cytochalasin● ●● ●●● ● ●● ●●●●●●●● ● ● ●●●●●● ● ●●●●● ● ● Cyclopiazonic acid ●● ● ●● ●● ● ●●● ●●●●● Aspyridone Chaetoglobosin ●● ●●● ●● ● ●●● IsoflavipucinePseurotin A ● ●●●●● ●●●● ● ●●●● ●● ● Pyranonigrin E ● ●●● ●●● ●●●●● ● ●●●● ●●●● ● ●●●●● ●● ●● ●●●●●● ●●●● ● ●●●● ● ●●●● ●●●●●● ● ● ● ●● ● ●●●●● ●●● ● ●● ●●●●●● ●● ●●●●●●●● ●● ●●●●●● ●●●●●●● ●●●●●●●● ●● ● ●●●● ● ●●● ●●●● ● ●●● ● ●● ● ● ● ●● ●●●●● ● ● ●●●● ●●●●●● ● ● ●●●●●● ●●●●●●● ● ●● ●●●●● ● ● ●●●● ● ● ● ●●● ● ●●● ●●● ● ● ●●● ●●● ● ●●●●●● ●● ● ● ●●●● ● ● ● ● ●●●● ●● ●●●● ●●●● ●●●● ●● ●●●● ●● ● ●●●●●● ● ●● ● Terrequinone ●● ●●● ●●● ●● ●● ●●● ●●● ●●●●● ● 0.5

Figure 4 ML phylogeny of A domains. Tip colors indicate SM protein type; Tip labels show associated compound (if applicable). The phylogenetic tree shows that hybrids are monophyletic when compared to NRPSs and NRPS-likes.

105 Theobald et al. Page 13 of 14

● ● ● ● ● ● ● ● ● 59 ● ● 71 ● ● ● ● ● ● ● ● 93 ● Cytochalasin ● 80 ● ● ● ● ● ● Section ● ● ● ● ● Candidi ● ● ● ● Circumdati ● ● ● ● ● Clavati ● ● ● ● ● Flavi ● ● ● ● Fumigati ● ● ● ● ● Nidulantes ● ● ● ● ● Niger_biseriates ● ● ● ● Niger_uniseriates ● ● ● ● ● Ochraceorosei ● ● ● ● Penicillium ● ● ● ● Terrei ● ● ● ● ● ● ● ● SM type ● 64 ● ● Cyclopiazonic acid ● HYBRID ● ● ● NRPS ● ● ● NRPS−Like ● ● ● ● NRPS−Like, NRPS ● ● ● ● 51 ● ● Aspyridone Chaetoglobosin ● ● ● ● 89 ● ● ● ● ● ● ● 77 ● ● ● ● ● ● ● 90 ● ● ● ● ● ● ● 90 ● Pseurotin A ● ● ● ● Isoflavipucine ●

82

5 6 7 8 9 10

Figure 5 Subtree of ML phylogeny of A domains Tip color indicates section/subgroup; tip shape indicates SM protein type; tip label shows associated compound (if applicable). A group of NRPS and NRPS-likes from uniseriate Nigri species forms a sister clade to the monophyletic hybrids. We only find two NRPSs of Fumigati species and one of A. steynii (section Circumdati) inside hybrid groups.

106 Theobald et al. Page 14 of 14

Fusaridione A ● ●● Equisetin ●●●● ● Aspyridone ● Cyclopiazonic acid ● ● ● ●● ● ● ● ● ● Fumosorinone ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●● ●●● ● ● ● ●● ● ●●●●● ●●● ● ● ● ●● ● ● ● Chaetoglobosin ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●●● ●●●●●●●● ●● ● ●●●●● ● ● ● Pseurotin A ● ●● ● ● ●● ●● ●●● ●● ●●●●● ● ●● ● Isoflavipucine ●●●●●●●●●● ● ● ● ●● ●●● ●●● Cytochalasin ●●●●● ● ● ● ●●● ● ● syn2 ● ●●●● ●● ● ●●● ●● ● ● ● ● ● ●●● ●● ●●● ●● ● ● ●● ● ● Xenolozoyenone ● ● ● ● ●●● ●●●●●●● ● Genus/ Class SM type ●●●●● ● ●●●● ● ●● ● Aspergillus Hybrid ● Pyranonigrin E

● NRPS Dothideomycetes ●● ● ● Eurotiomycetes ● ● ● ● Exobasidiomycetes ● ● ● Leotiomycetes ● ● Thanamycin ● ● Orbiliomycetes ●● Thanamycin ● ● ● ● ● ● Penicillium ● ● Planctomycetes ● Proteobacteria ● ● ● Sordariomycetes ● Thanamycin ● ● ●●● ● ● ● Terrabacteria group ● ● Thanamycin ●● ● Thanamycin ●●●●●● ● Undefined ● ●●● ● Xylonomycetes NRPS-PKS hybrids Bacterial sequences

Figure 6 ML phylogeny of A domains from PKS-NRPS and NRPS-PKS hybrid genes and their best blast hits from the NCBI non-redundant protein database. Protein sequences were aligned with clustal omega, trimmed and filtered using trimal, and a ML phylogeny generated using iqtree. Compounds of hybrids, if present in MIBiG are shown above tips. Thanamycin is shown multiple times because each domain of the lipopeptide synthetase is annotated. Tip colors indicate whether the hybrid domain is derived from the genera Aspergillus, Penicillium or a different taxonomical class.

107 108 Chapter 5

Comparative genomics of A. nidulans and section Nidulantes

Aspergillus nidulans has long served as the sole representative of its section in comparative genomics studies. In this study we are de novo sequencing section Nidulantes and using our established pipeline to investigate whether proteins and secondary metabolites of A. nidulans are generally present in the whole section. This is an important step in our effort to sequence the whole genus, since it will provide an estimate of how applicable the con- clusions of a model organism are to the whole section. Furthermore, we investigate general regulators and their distribution throughout the section to get insights into fungal speciation. Copy number variations as well as gene loss of central regulators and developmental genes has been observed. We also find differences in central pathways like pigmentation. The chapter will give an insight into the distribution of investigated proteins throughout sec- tion Nidulantes and other Aspergillus species and identify whether proteins are essential or can be replaced by paralogs. Additionally, we will investigate the dynamics of secondary metabolite gene clusters and potential horizontal gene transfers.

109 1 63 2 Comparative genomics of Aspergillus nidulans and 64 3 65 4 section Nidulantes 66

5 a a a a b b 67 6 Sebastian Theobald , Tammi Vesth , Jane Lind Nybo , Inge Kjæbølling , Robert Riley , Asaf Salamov , Ellen Kirstine 68 Kyhnea, Martin Engelhard Koglea, Jens Christian Frisvada, Jakob Blæsbjerg Hoofa, Uffe H. Mortensena, Paul Dyerc, Michelle 7 e a d a,1 69 8 Momany , Thomas Ostenfeld Larsen , Scott Baker , and Mikael Rørdam Andersen 70

9 aDepartment of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark; bDepartment of Energy Joint Genome Institute, Walnut 71 10 Creek, CA 94598; cSchool of Life Sciences, University of Nottingham, Nottingham, UK NG7 2RD; dEarth and Biological Sciences Directorate, Pacific Northwest National 72 e 11 Laboratory, Richland, USA; Fungal Biology Group & Plant Biology Department, University of Georgia,Athens, Georgia, USA 30602 73 12 This manuscript was compiled on February 1, 2018 74 13 75 14 Aspergillus nidulans is an important model organism for eukaryotic highlighted similarities between these fungi, but have not 76 15 biology and has served as the reference for the section Nidulantes in covered sufficient species to proof the general applicability 77 16 comparative genomics studies. In this study, we de novo sequenced of A. nidulans as a model. We are the first ones to set A. 78 17 25 species. Whole-genome phylogeny of 34 Aspergillus species and nidulans into genomic context to its close relatives of section 79 18 Penicillium chrysogenum clarifies the position of clades inside sec- Nidulantes. 80 19 tion Nidulantes. Comparative genomics reveals a high genetic di- In this study, we investigate the dynamics of selected ho- 81 20 versity between species with 684 up to 2433 unique protein families. mologous protein families throughout 34 Aspergillus (and P. 82 21 Nonetheless, protein families for investigated general regulators and chrysogenum) species to assess the application of A. nidulans 83 22 developmental proteins are conserved. Furthermore, we categorized as a model organism for these species. We provide further 84 23 2118 secondary metabolite gene clusters (SMGC) into 603 families insights into the phylogeny of section Nidulantes using whole- 85 24 across Aspergilli, with at least 40% of the families shared between genome-phylogenetic approaches and investigate the diversity 86 25 Nidulantes species. Genetic dereplication of SMGC and subsequent of secondary metabolite gene clusters (SMGCs). 87 26 synteny analysis provides evidence for horizontal gene transfer of 88 27 a SMGC. Our analysis shows that proteins investigated in A. nidu- Results 89 28 lans as well as its SMGC families are generally present in the section 90 Whole genome phylogeny of section Nidulantes. Phyloge- 29 Nidulantes, supporting its role as model organism. 91 30 netic analysis provides insight into the taxonomy of species. 92 Aspergillus Comparative genomics Secondary metabolism Multiple genes were used to improve taxonomy of section 31 | | 93 32 Nidulantes but still disagreed on section boundaries (8, 22). 94 33 With the increasing number of genome sequences available, 95 esearch on Aspergillus nidulans has benefited the scientific 34 we sought to improve the section definitions and species rela- 96 community in many ways. Its use as a model organism 35 R tionships using whole-genome phylogenetic methods. 97 for eukaryotic biology has provided insights into cell polarity 36 We constructed a whole genome phylogeny of the dataset 98 and cell cycle mechanisms (1), DNA repair mechanisms (2), 37 using 200 concatenated genes to increase the resolution of the 99 morphogenesis (3) and cytosceleton development (4). In addi- 38 maximum likelihood (ML) phylogeny of section Nidulantes and 100 tion, studies on A. nidulans has added important information 39 confirm relationships among clades. The ML phylogeny shows 101 to the understanding of antifungal drug resistance (5–7). 40 high bootstrap support for most species and is hence providing 102 41 A. nidulans is part of section Nidulantes which has been consistent results based on our gene selection. Phylogenetic 103 42 synonymized with section Versicolores and Aenei — defining distance was used to establish clades among clustering species 104 them as subclades of section Nidulantes (8).DRAFT Members of the and bootstraps were used to support the hypothesis of clade 43 105 44 section are mainly decomposers of plant material, but also separation (Fig.1). 106 45 include species associated with human infections in the case Using the ML tree we can identify several clades within 107 46 of A. unguis and A. versicolor — the latter affecting indoor section Nidulantes. The A. nidulans clade (clade I) shows 108 47 environments (9) while also being used as source for xylanases a clear distinction from other species of section Nidulantes 109 48 (10). The section also includes the coral pathogen (11) A. and can be further subdivided into three groups. Species 110 49 sydowii. around A. nidulans are forming one subgroup. The residual 111 50 Characteristic to species of the section are, beside the species of clade I are included in a second subgroup with only 112 51 morphological characters, their secondary metabolites (SMs). A. multicolor forming a third group. Our method resolves 113 52 Sterigmatocystin — a carcinogen (12) and common food con- all branches with high support, except for internal branches 114 53 taminant (13) — although produced throughout many sub- for A. indicus, A. similis and A. navahoensis which shows 115 54 genera, is mostly present in Aspergilli of section Nidulantes bootstrap support of 79% and 76%, respectively. Following 116 55 and Circumdati (14). Besides mycotoxins, some species also is A. unguis with 68%. The A. stella-maris clade (clade II) 117 56 produce medically relevant SMs, as e.g. penicillin (15–17). shows clear separation from other species. A. undulatus forms 118 57 Large interest in SMs has driven the investigation of secondary the outgroup of this clade. The A. versicolor clade (clade 119 58 metabolite gene clusters (SMGCs) (18–20). III) is placed as a sister clade to the A. aurantiobrunneus 120 59 Although A. nidulans has been used extensivly as a model 121

60 organism the question of whether it is an appropriate model The authors declare no conflict of interest. 122 61 has remained to be investigated. Sequencing of A. nidulans 123 62 and comparisons to A. fumigatus and A. oryzae (21) have 2To whom correspondence should be addressed. E-mail: mrbio.dtu.dk 124 110 www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | February 1, 2018 | vol. XXX | no. XX | 1–11 125 and A. varians. The most phylogenetically distant clade is This indicates that the analyzed species in section Nidulantes 187 126 the A. aeneus clade (clade VI) with species A. spectabilis, A. are overall more conserved as a section than section Nigri. 188 127 crustosus and A. karnatakaensis. The separation of section 189 128 Nigri and Nidulantes only shows 79% bootstrap support. Dynamics of general regulators throughout Aspergilli. The 190 129 Categorizations by Hubka et al (2016) (8) and Chen et al discovery of general regulators in A. nidulans led to a better 191 130 (2016) (22) using genetic methods and a polyphasic approach understanding of e.g. carbon catabolite repression through 192 131 identified similar clades of section Nidulantes. They used CreA (28), or pH response through PacC (29), or regulation of 193 132 (coding regions of) benA, the calmodulin gene (caM), RNA secondary metabolism LaeA (30) (among others). Using the 194 133 polymerase II gene (RPB2 ) and internal transcribed spacer protein families described above, we investigated the diversity 195 134 (ITS) sequences to establish a phylogeny of species using ML of general regulators and whether they are part of the core or 196 135 phylogeny. accessory proteome. 197 136 Our method shows that whole-genome phylogenies can We identified protein families with one protein per species 198 137 be used to separate section Nidulantes into clades of closely for CreA, (31); PacC, a regulator of transcription responding 199 138 related species with high bootstrap support. The phylogenetic to alkaline pH (29); and McrA, a master regulator of SM (32). 200 139 distance of A. undulatus in regard to clade II is relatively large LaeA, another regulator of SM (30), is found in a core family 201 140 which suggests a separation into its own clade. Furthermore, with an additional copy in A. unguis, A. foveolata and A. 202 141 bootstrap values under 80% could be a result of the increased aculeatinus. ML phylogeny of predicted LaeA homologs shows 203 142 resolution of our phylogeny. Our results are in accordance that the second copy in A. foveolata is from clade II (Fig. 204 143 with previous studies based on single gene trees regarding the S6). A. aculeatinus seems to contain a laeA homolog from 205 144 general organization of section Nidulantes. A. terreus and the second copy in A. unguis has no apparent 206 145 close relative in this set of species. 207 146 Genome statistics. Genome sizes in the species of Nidulantes The regulator GalR is unique to A. nidulans according to 208 147 range from 26.1 to 38.7 Mb, with a GC content of 45.5–50.8%. (33). We identified a protein family containing both GalR 209 148 The number of predicted proteins range from 8924 to 13620. and AraR (Fig. S5). ML phylogeny of homologs shows AraR 210 149 Clade IV tends to have larger genomes than other Nidulantes and GalR in separate branches in section Nidulantes and A. 211 150 species (37.5 Mb on average), while A. unguis shows a smaller ochraceroseus with high bootstrap support. Thus GalR is 212 151 genome size (26.1 Mb). These statistics are similar to the present in all species of section Nidulantes and potentially 213 152 overall Aspergillus genomes reported previously (26). Only in A. ochraceroseus. WscB, a putative stress sensor (34), is 214 153 A. unguis and clade IV deviate from the general Aspergilli unique to species of clade I, while WscA is part of the core 215 154 genome sizes. proteome. Absence of one part of the complex suggests that 216 155 WscB is more variable than WscA. 217 156 Core pan proteome. To investigate section Nidulantes pro- In summary, we are able to find important general regula- 218 157 teome diversity and A. nidulans similarity to other species, tors in the core proteome, emphasizing that investigated pro- 219 158 we generated homologous protein families across all species teins in A. nidulans are conserved throughout species. McrA 220 159 and related them to the phylogenetic tree (Fig.2). in other Aspergillus species is a good target for overexpression 221 160 The protein families found in all members of the Aspergilli studies. Previous studies have shown that McrA overexpres- 222 161 (Aspergillus core proteome) covers 4033 families. Nidulantes sion induces production of secondary metabolites (35). LaeA 223 162 species share 105 protein families (Nidulantes specific families). shows additional copies in some species which, according to 224 163 Remarkably, species up until clade I only share very few core their phylogeny, seem to be paralogs of different species (Fig. 225 164 protein families. For clade I we can identify 39 clade-specific S6). Hence we suggest that LaeA is a specialized rather than 226 165 protein families which increases to 44 for the two clade I a global regulator (30). GalR, which was suggested to be 227 166 subgroups. A. multicolor contains 2411 species-uniqueDRAFT protein unique to A. nidulans (33) is predicted to have homologs in 228 167 families, which is quite surprising considering that we included all Nidulantes species. Whether the A. ochraceroseas copy is 229 168 many species of its own clade. In comparison, clade IV shows a GalR homolog could not be identified on the sole basis of 230 169 approximately 2400 species-unique protein families per species ML phylogeny (Fig. S5). If it is a GalR then this regulator is 231 170 as well — only by including three species from the same clade. distributed through the whole subgenus Nidulantes. 232 171 Thus, emphasizing the distance from A. multicolor to clade 233 172 I. 446 protein families are unique to A. fructiculosus and A. Analysis of proteins involved in polarity, cytoskeleton and 234 173 falconensis suggesting them to be isolates of the same species cell cycle. The highly polar growth of filamentous fungi makes 235 174 — which is further supported by the whole-genome phylogeny them ideal for studies of morphology and its coordination 236 175 (Fig.1). A. nidulans only contains 682 protein families which with the cell cycle. The spatial separation of nuclei and 237 176 are species-unique, suggesting that much of its proteome is well-developed genetic system of A. nidulans has made it 238 177 representative of section Nidulantes. an attractive model for many of these studies (1, 36, 37). 239 178 The protein families of section Nidulantes are distinct from To investigate whether A. nidulans is a good model for 240 179 the ones observed earlier for section Nigri since the number of morphology and cell cycle in other Aspergilli, we identified 241 180 isolates and the phylogenetic distance of species are differnt homologs of key polarity genes including the Rho GTPase 242 181 (27). However, we can show that in both cases species show Cdc42 (modA), its associated guanine nucleotide exchange 243 182 around 1000 species-unique proteins, this similarity drops to factor (GEF) Cdc24 and GTPase-activating protein (GAP) 244 183 under 400 species-unique proteins when isolates of the same Rga1, and the p21-activated kinase Cla4. We also identified 245 184 species are compared. Section Nidulantes tends to share more homologs of cytoskeletal genes including actin, tubulins and 246 185 section-unique proteins than clade-unique proteins. Nigri septins and key cell cycle genes including cyclin dependent 247 186 species tend to share more proteins on the clade-unique level. kinases, wee1 kinase, phosphatases and anaphase promoting 248 111 2 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 249 311

250 genomeSizeMB proteinCount scaffoldCount gcFraction 312 clade ● A. tetrazonus 32.1 11156 452 50.1 251 ● 313 ● A. aurantiobrunneus A. nidulans 30.5 10680 8 49.3 252 ● A. foveolata 31.8 11115 425 50.4 314 ● 253 ● A. unguis A. acristatulus 32.6 11221 311 50.2 315 ● A. neoechinulatus 30.7 10504 236 49.9 254 ● 32.5 11361 601 50.5 316 ● A. varians 79 A. indicus ● A. similis 33.2 11407 569 50.3 255 ● 317 ● clade_I A. fructiculosus 34.3 11612 279 50.6 256 76 ● A. falconensis 34.1 11645 666 50.7 318 ● 257 ● clade_II A. navahoensis 32.8 11625 732 50.9 319 ● A. recurvatus 30.7 11007 269 51.6 258 ● 29.3 10324 278 51.5 320 ● clade_III A. stercoraria 259 68 ● A. desertorum 29 10111 288 51.5 321 ● ● clade_IV A. multicolor 35.3 12821 113 50.7 260 ● A. unguis 26.1 10397 142 50.5 322 ● 261 ● references A. stella−maris 34 12537 266 50.2 323 ● A. venezuelensis 34.8 12450 715 49.6 262 ● A. filifera 31.5 11686 814 50.8 324 263 ● A. olivicola 33.2 12183 824 50.6 325 ● A. undulatus 32.1 11802 190 50.7 264 ● A. amoeneus 33.4 13460 173 50.2 326 265 ● A. versicolor 33.1 13228 51 50.1 327 ● A. sydowii 34.4 13620 97 50 266 ● A. aurantiobrunneus 31.4 11287 278 49.2 328 267 ● A. varians 29.8 10839 427 50.1 329 ● A. spectabilis 38.7 13927 715 50 268 ● A. crustosus 37.4 12063 462 45.5 330 ● 269 79 A. karnatakaensis 36.3 13540 239 49.9 331 ● A. ochraceoroseus 27.7 8924 34 44.2 270 ● A. niger ATCC 1015 34.9 11910 24 50.3 332 271 ● A. aculeatinus 36.5 12027 121 50.3 333 ● 272 A. terreus 29.3 10406 26 52.7 334 ● A. fumigatus Af293 29.4 9781 8 48.8 273 ● A. clavatus 27.9 9121 143 49.2 335 ● P. chrysogenum 31.3 11396 27 48.6 274 0.01 336 275 337 276 Fig. 1. Dendrogram of species. Whole genome phylogeny constructed from alignment of 200 bidirectional best hits between species using RAxML (23), MAFFT (24), and 338 Gblocks (25). Bootstrap support shown at node if under 80. Tip color indicates clade. Genome statistics shown in right panel. 277 339 278 340 279 341 280 342

281 Accessory Unique Core 343 282 344 283 Tree Proteome 345 731 A. tetrazonus 4,692 734 11,155 56 684 284 4 A. nidulans 4,666 720 10,673 346 285 9 895 A. foveolata 4,658 929 11,115 347 2 971 A. acristatulus 4,681 982 11,220 722 A. neoechinulatus 4,636 722 10,503 286 7 348 902 A. indicus 4,700 907 11,360 24 287 1055 A. similis 4,708 1,067 11,406 349 890 A. fructiculosus 4,683 897 11,612 288 44 446 350 803 4 A. falconensis 4,678 807 11,645 289 17 1540 A. navahoensis 4,652 1,560 11,625 351 1092 A. recurvatus 4,633 1,100 11,007 39 1 290 1139 A. stercoraria 4,563 1,172 10,323 352 114 4 1086 DRAFTA. desertorum 4,563 1,098 10,111 291 353 2411 A. multicolor 4,659 2,418 12,821 292 1296 A. unguis 4,559 1,306 10,396 354

0 1368 A. stella−maris 4,700 1,372 12,536 293 193 355 1420 131 A. venezuelensis 4,683 1,424 12,449 294 119 1242 A. filifera 4,649 1,258 11,685 356 54 1607 A. olivicola 4,667 1,612 12,182

295 8 1939 A. undulatus 4,656 1,940 11,802 357 1165 A. amoeneus 4,942 1,168 13,460 296 355 358 1002 588 A. versicolor 4,944 1,004 13,227 297 25 1446 A. sydowii 4,963 1,476 13,620 359 105 298 27 1681 A. aurantiobrunneus 4,595 1,737 11,287 360 1587 A. varians 4,599 1,596 10,839 299 2486 A. spectabilis 4,859 2,541 13,926 361 85 104 2043 120 A. crustosus 4,639 2,073 12,061 300 2433 A. karnatakaensis 4,748 2,441 13,540 362 17 301 1633 A. ochraceoroseus 4,362 1,675 8,924 363 26 2824 A. niger ATCC 1015 4,729 2,859 11,909 315 302 3074 A. aculeatinus 4,695 3,099 12,027 364 303 4033 2123 A. terreus 4,658 2,145 10,406 365 1345 A. fumigatus Af293 4,554 1,399 9,780 386 304 1150 A. clavatus 4,473 1,151 9,121 366 305 0.0 2.5 5.0 7.5 0 5000 10000 15000 367 306 368 Fig. 2. Core, accessory and unique proteins through Aspergillus species. Protein families were built using single linkage on bidirectional protein blast hits with a percent identity 307 of at least 50% and sum of coverage (query and subject) of at least 130% . (Left) Counts on nodes show unique protein families for species included in the branch; counts on 369 308 tips show unique protein families for individual organisms. (Right) Core, unique and accessory proteins represented in stacked bars per species. 370 309 371 310 372 112 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 3 A. tetrazonus ● ● ● ● ● ● ● ● ● ● 373 Homolog count A. nidulans ● ● ● ● ● ● ● ● ● ● 435 374 ● 0 A. foveolata ● ● ● ● ● ● ● ● ● ● 436 A. acristatulus ● ● ● ● ● ● ● ● ● ● 375 ● 1 A. neoechinulatus ● ● ● ● ● ● ● ● ● ● 437 A. indicus ● ● ● ● ● ● ● ● ● ● 376 ● 2 438 A. similis ● ● ● ● ● ● ● ● ● ● 377 A. fructiculosus ● ● ● ● ● ● ● ● ● ● 439 A. falconensis ● ● ● ● ● ● ● ● ● ● 378 A. navahoensis ● ● ● ● ● ● ● ● ● ● 440 379 A. recurvatus ● ● ● ● ● ● ● ● ● ● 441 A. stercoraria ● ● ● ● ● ● ● ● ● ● 380 A. desertorum ● ● ● ● ● ● ● ● ● ● 442 A. multicolor ● ● ● ● ● ● ● ● ● ● 381 A. unguis ● ● ● ● ● ● ● ● ● ● 443 382 A. stella−maris ● ● ● ● ● ● ● ● ● ● 444 A. venezuelensis ● ● ● ● ● ● ● ● ● ● 383 A. filifera ● ● ● ● ● ● ● ● ● ● 445 384 A. olivicola ● ● ● ● ● ● ● ● ● ● 446 A. undulatus ● ● ● ● ● ● ● ● ● ● 385 A. amoeneus ● ● ● ● ● ● ● ● ● ● 447 A. versicolor ● ● ● ● ● ● ● ● ● ● 386 A. sydowii ● ● ● ● ● ● ● ● ● ● 448 387 A. aurantiobrunneus ● ● ● ● ● ● ● ● ● ● 449 A. varians ● ● ● ● ● ● ● ● ● ● 388 A. spectabilis ● ● ● ● ● ● ● ● ● ● 450 A. crustosus ● ● ● ● ● ● ● ● ● ● 389 A. karnatakaensis ● ● ● ● ● ● ● ● ● ● 451 390 A. ochraceoroseus ● ● ● ● ● ● ● ● ● ● 452 A. niger ATCC 1015 ● ● ● ● ● ● ● ● ● ● 391 A. aculeatinus ● ● ● ● ● ● ● ● ● ● 453 392 A. terreus ● ● ● ● ● ● ● ● ● ● 454 A. fumigatus Af293 ● ● ● ● ● ● ● ● ● ● 393 A. clavatus ● ● ● ● ● ● ● ● ● ● 455 394 456 veA brlA laeA creA wetA mcrA pacC abaA MAT2 395 MAT1 457

396 Fig. 3. Selection of protein families with known examples in dataset. Protein families are shown as columns with a dots indicating copy number by color. Column labels 458 397 correspond to annotated A. nidulans genes. 459 398 460 399 461 400 complex (APC) components (Fig.4). As expected, multiple examined showing the gain of 1-3 homologs. Complete loss 462 401 Cdc42 homologs were present in all Aspergilli examined and of all homologs of a specific gene was less common, occurring 463 402 included Cdc42, RacA, Rho4, and RhoA along with 2-3 other in 16/34 species examined, with the nonessential septin AspE 464 403 uncharacterized Rho GTPases. Single homologs of Rga1 and accounting for half of the cases. Loss of a gene family was 465 404 Cla4 were also present in all species examined. While most most common in clades III and IV. A. nidulans contained 466 405 of the species examined had a single Cdc24 GEF homologue, representatives of each polarity, cytoskeletal, or cell cycle gene 467 406 five of the clade III and clade IV Aspergillus species lacked a examined and so in terms of gene content is a good model for 468 407 Cdc24 GEF. This is surprising because Cdc24 is an essential all of the Aspergillus species examined. 469 408 gene in A. nidulans (38). The lack of Cdc24 homologs in 470 409 five of 34 species examined, along with conservation of other Analysis of CAZymes. Carbohydrate active enzymes 471 410 key polarity genes, suggests that a more highly diverged gene (CAZymes) determine the plant biomass degrading potential 472 411 might encode a GEF for Cdc42 in these species. of fungi. Aspergilli contain many of these CAZymes which 473 412 Single homologs of polarisome components SpaA, BudA, enables them to degrade numerous polysaccharides (46). 474 413 BemA and Ste20 were present in all species, while four lacked Throughout clades, the Aspergillus species of section Nidu- 475 414 the formin SepA (39). All species containedDRAFT single or mul- lantes show generally the same diversity of CAZymes while 476 415 tiple homologs of cytoskeletal elements actin, actin related differing in the amounts of specific classes5. In the auxiallary 477 416 protein Arp1 (nudK) and tubulins. Alpha and beta tubulins category aryl alcohol oxidase and glucose 1-oxidase (AA3_2) 478 417 ranged from 1-3 homologs per species, consistent with pre- are the predominant classes, followed by copper-dependent 479 418 viously reported presence of tubulins specialized for specific lytic polysaccharide monooxygenases (AA9). Carbohydrate 480 419 developmental states (40, 41) The core septins AspACDC11, binding modules (CBM) show two dominant classes in Nidu- 481 420 AspBCDC3, AspCCDC12, AspDCDC10 had single homologs in lantes: CBM50, CBM18; a contrast to Aspergillus references 482 421 each species, consistent with the assembly of these core septins where CBM1 is the most abundant class, followed by CB50 and 483 422 into complexes (42). Strikingly the noncore septin AspE was CBM18. Glycosylhydrolases (GH) GH3, GH43 and GH18 are 484 423 absent in 11 of the 34 species examined. This is similar to the most common ones in Nidulantes species, except for A. au- 485 424 the patchy distribution of this nonessential septin across fungi rantiobrunneus which contains slightly more GH16 than GH18. 486 425 and across kingdoms (42–44). All of the key cell cycle genes GH28 is increased in Aspergillus references compared to Nidu- 487 426 examined had at least one homologue present in all species lantes. Glycosyltransferases (GT) and polysaccharide lyases 488 427 with the exception of the wee1 kinase AnkA which was absent (PL) show similar abundance patterns through Aspergillus 489 428 in two clade I species. AnkA is an essential gene in A. nidulans species. The varying amounts of CAZyme classes indicates 490 429 (45), so its absence in some species is surprising and once more the specialization of clades towards certain polysaccharides. 491 430 suggests that a gene too highly diverged to be detected might 492 431 play the same role in these species. Secondary metabolite gene cluster diversity confirms clade 493 432 In terms of general trends, we found that extra homologs concepts. Aspergilli show a vast diversity of secondary metabo- 494 433 of polarity, cytoskeletal and cell cycle genes were common lites (SMs), with dynamics in secondary metabolite gene clus- 495 434 across the Aspergillus species examined, with all 34 species ters (SMGCs) throughout species leading to new compounds. 496 113 4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 497 559 498 560 499 561 500 562 501 563 502 564 503 565 504 566 505 567 506 568 507 569 508 570 509 571 510 A. tetrazonus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 572 Homolog count 511 A. nidulans ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 573 ● 0 A. foveolata ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 512 A. acristatulus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 574 513 ● 1 A. neoechinulatus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 575 A. indicus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 514 ● 2 576 A. similis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 515 ● 3 A. fructiculosus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 577 516 A. falconensis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 578 ● 4 A. navahoensis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 517 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 579 ● 6 A. recurvatus 518 A. stercoraria ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 580 519 ● 7 A. desertorum ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 581 A. multicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 520 A. unguis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 582 521 A. stella−maris ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 583 522 A. venezuelensis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 584 A. filifera ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 523 A. olivicola ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 585 524 A. undulatus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 586 525 A. amoeneus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 587 A. versicolor ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 526 A. sydowii ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 588 527 A. aurantiobrunneus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 589 A. varians ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 528 A. spectabilis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 590 529 A. crustosus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 591 530 A. karnatakaensis ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 592 A. ochraceoroseus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 531 A. niger ATCC 1015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 593 532 A. aculeatinus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 594 533 A. terreus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 595 A. fumigatus Af293 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 534 A. clavatus ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 596 535 597 cla4

536 nimT 598 mipA nimE nimA bimE aspA aspB aspE ankA sepA spaA budA aspC aspD ste20 bemA RGA1

537 CDC24 599 tubA,tubB actA,nudK 538 DRAFTbenA,tubC 600 539 601

540 nimX,phoA,phoB 602 541 603 rho4,racA,modA,rhoA 542 604 543 Fig. 4. Distribution of polarity, cytoskeleton and cell cycle proteins throughout the dataset. Protein families for A. nidulans proteins (one family per column) involved in polarity, 605 544 cytoskeleton and cell cycle were extracted from protein families and annotations concatenated if multiple annotated proteins were found per family (annotated on the x-axis). 606 Homologs are shown for all species of the dataset. Number of homologs shown by point color. 545 607 546 608 547 609 548 610 549 611 550 612 551 613 552 614 553 615 554 616 555 617 556 618 557 619 558 620 114 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 5 621 683 622 684 623 685 624 686 625 687 626 688 627 689 628 690 629 691 630 692 631 693 632 694 633 695 634 696 Average CAZyme frequencies for groups PL

635 AA 697 CE GT GH CBM EXPN 636 698

20 637 clade_I 699 638 10 700 639 0 701

640 20 A. unguis 702

641 10 703

642 0 704 643 705 20 644 clade_II 706 645 10 707 646 0 708

20 647 clade_III 709 Groups clade_I 648 10 710 A. unguis clade_II 649 0 711 clade_III A. aurantiobrunneus A. aurantiobrunneus

650 Protein count A. varians 712 20 Aspergillus_references 651 P. chrysogenum 713 652 10 714 653 0 715

654 20 A. varians 716

655 10 717

656 0 718 657 Aspergillus_references 719 20 658 720 10 659 721 660 0 722 P. chrysogenum 661 20 723 662 10 724

0 DRAFT 663 725 PL1 AA9 AA1 AA7 AA4 AA8 AA3 AA6 AA2 CE4 CE5 CE1 CE3 CE8 CE9 CE2 GT2 GT1 GT4 GT8 GT5 GT3 GH3 GH2 GH1 GH7 GH6 GH4 PL20 PL26 PL27 AA11 AA13 CE16 CE12 CE15 GT32 GT31 GT20 GT22 GT71 GT34 GT25 GT15 GT62 GT39 GT90 GT69 GT57 GT61 GT76 GT48 GT17 GT21 GT24 GT33 GT35 GT41 GT50 GT58 GT59 GT66 GT91 GH43 GH18 GH16 GH31 GH28 GH78 GH76 GH47 GH71 GH72 GH17 GH92 GH35 GH55 GH36 GH32 GH27 GH95 GH51 GH88 GH10 GH12 GH93 GH15 GH20 GH64 GH79 GH11 GH75 GH25 GH24 GH33 GH39 GH62 GH29 GH74 GH53 GH45 GH26 GH67 GH37 GH19 GH23 GH38 GH42 GH49 GH54 GH63 GH65 GH81 GH84 GH89 EXPN CBM1 CBM6 PL1_4 PL3_2 PL1_7 PL4_1 PL1_2 PL4_3 PL4_5 PL7_4 PL9_3 AA3_2 AA1_3 AA1_2 AA3_1 AA5_2 AA3_3 AA3_4 GH5_7 GH105 GH135 GH115 GH5_9 GH132 GH5_5 GH114 GH134 GH128 GH140 GH127 GH131 GH106 GH125 GH130 GH133 GH136 GH139 GH142 GH145 GH5_4 CBM50 CBM18 CBM43 CBM35 CBM48 CBM20 CBM32 CBM24 CBM67 CBM14 CBM38 CBM13 CBM19 CBM21 CBM42 CBM52 CBM63 CBM66 PL11_2 AA2_cyt GH13_1 GH5_23 GH13_5 GH5_15 GH5_22 GH5_12 GH5_16 GH13_8 GH30_3 GH30_7 GH5_27 GH5_31 GH5_49

664 GH13_40 GH13_22 GH13_25 726 665 CAZyme category 727 666 Fig. 5. Analysis of CAZymes. Bars show the average number of proteins in each clade (references split up and averaged). Group indicated by color. Top panels show categories: 728 667 Auxiliary activities (AA), carbohydrate-binding molecules (CBM), carbohydrate esterases (CE), Distant plant expainins (EXPN), glycoside hydrolases (GH), Glycosyltransferases 729 668 (GT), polysaccharide lyases (PL) 730 669 731 670 732 671 733 672 734 673 735 674 736 675 737 676 738 677 739 678 740 679 741 680 742 681 743 682 744 115 6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 745 Hence, it was an obvious step to aggregate SMGCs into ho- duce related compounds we annotated the SMGC family in 807 746 mologous families to describe the variety of SMGCs through a guilt by association approach — a process we term genetic 808 747 species in section Nidulantes and investigate whether they dereplication. 809 748 further confirm the assignment of species into clades. Our analysis revealed several conserved clusters. The sterig- 810 749 We compared all SMGC members and calculated a similar- matocystin, fellutamide B (51), emericellamide, YWA, en- 811 750 ity score for each SMGC pair. Using random walk clustering docrocin (52), emericellin as well as the asperthecin SMGC 812 751 on resulting similarity networks of SMGCs categorized 2118 (53) have related gene clusters in almost all species of section 813 752 SMGC into 603 families. Of these families 433 are unique to Nidulantes. Thus, they are characteristic for the section. The 814 753 a species, 175 of the unique gene clusters are belonging to alternatiol, cichorine, F9775, asperfuranone, azanigerone and 815 754 references (Supplementary Fig. 1). 29 SMGC families are terrequinone cluster homologs are present in many clade I 816 755 shared between two species, while 89 are shared between 3-10 members and scattered over clade III and IV. This pattern 817 756 species. Identifying the shared secondary metabolite gene indicates that the gene clusters have been acquired before 818 757 cluster (SMGC) families throughout the section Nidulantes, differentiation of the Nidulantes section, then lost in some 819 758 we can determine that species share 40-50% of their SMGC species during their development. 820 759 families (Fig.6). Clustering the species by their shared SMGC 821 760 content, we can identify subgroups inside the section which are Apart from the conserved gene clusters in section Nidu- 822 761 to a large extent in agreement with clades based on phylogeny. lantes, we identified SMGCs that are only present in a few 823 762 Species inside clade I share at least 50% of their SMGC fam- species and suggest recent acquisition by horizontal gene trans- 824 763 ilies, while many members in subgroup I share 70-80% SMGC fer. A gene cluster in A. multicolor shows high synteny con- 825 764 families, meaning that their repertoire of SMs is largely the servation to the hexadehydro-astechrome (HAS) gene cluster 826 765 same. A. fructiculosus and A. falconensis in subgroup II share of A. fumigatus, whose product increases virulence (54), and 827 766 between 90-100% of their SMGC families, an amount com- to the astechrome gene cluster in A. terreus and8. The lat- 828 767 monly shared by isolates of the same species. A. multicolor ter cluster only produces astechrome since it misses a flavin 829 768 shares the lowest amount of SMGC with the whole its clade adenine dinucleotide (FAD) binding enzyme (54, 55). The A. 830 769 (40-60%). A. unguis shares mostly 40-50% SMGC with clade multicolor cluster, however, contains the FAD gene, suggesting 831 770 I and 30-40% with the rest of the section — confirming that it to produce HAS. To confirm the hypothesis of HGT, we 832 771 this species forms its own clade. Species in clade III show a identified a syntenic site to the area surrounding the HAS- 833 772 high conservation of their SMGC content with 70-80% SMGC homolog-locus of A. multicolor in A. nidulans (Fig.8). We 834 773 families shared between species. An exception is A. undula- sustained our hypothesis of HGT comparing a ML phylogeny 835 774 tus, a member of clade III, which only shares 50-60% SMGC of HAS homologs against homologs of the conserved NRPS, 836 775 families with the other members of it’s own clade — also indi- SidC. Homologs of SidC in A. multicolor, A. fumigatus and A. 837 776 cating it as its own clade. The SMGC similarity drops with A. terreus show a greater distance than their HAS homologs (Fig. 838 777 aurantiobrunneus, A. varians and clade IV — only showing 8). Furthermore, the distances shown for their HAS homologs 839 778 low conservation in SMGC families. This suggests a different resemble that of conserved NRPSs inside the same section, 840 779 SM content in these species than in other Nidulantes species. suggesting recent aquisition of the HAS cluster. Thus, we show 841 780 In summary, we can further sustain the concept of section evidence for a HGT of the HAS NRPS from A. fumigatus to 842 781 Nidulantes using SMGC content. Most comparisons indicate A. multicolor. 843 782 a SMGC similarity of 40 percent and higher. Remarkably, A related SMGC to the penicillin gene cluster can be found 844 783 clade III and clade IV share more SMGC families with species in several species of clade I and III, as well as P. chrysogenum 845 784 of clade I than between each other, suggesting that deletion (Fig.6). Furthermore, we find related gene clusters for 3,5- 846 785 events defined the SMGC content of these clades when they dimethylorsellinic acid based meroterpenoids terretonin (56), a 847 786 diverged from a common ancestor. Our resultsDRAFT point to a mycotoxin, and austinol/dehydroaustinol (57), andrastin A, a 848 787 general conservation of SMGC in section Nidulantes. Previous promising antitumor agent (58), in one family. The gene clus- 849 788 studies in section Nigri showed that species can have drasti- ter in A. stella-maris (supplemental file clusters) resembles the 850 789 cally different SMGC content which was as low as 10–20%. gene cluster structure of the calidodehydroaustin gene cluster 851 790 Here we find species to have at least 20–30% similarity while (59). Six known examples of gene clusters are found through- 852 791 most comparisons are showing at 40–50%. This supports that out Nidulantes, while four are completely absent in clade II and 853 792 clades III and IV are inside section Nidulantes which was still other gene clusters investigated in A. nidulans have scattered 854 793 discussed (8, 47). presence patterns throughout species — as for the aspyridone 855 794 hybrid gene cluster. In summary, we can identify different 856 795 Secondary metabolites in section Nidulantes. Analytical stud- SMGC classes in section Nidulantes than in previous analyses 857 796 ies of Aspergilli have shown a large chemical diversity through of section Nigri. While section Nigri tends to contain more 858 797 sections (48) with only a small fraction of compounds being 6-methylsalicilic acid synthase derived clusters (Yanuthones, 859 798 conserved over large phylogenetic distances. In many cases secalonic acid), section Nidulantes contains more orsellinic 860 799 SMs have been used to identify species (49). Hence, we were acid derived gene clusters (i.e. F9775, dehydroaustinol). The 861 800 interested in the diversity of known secondary metabolite gene predicted SMGC for F9775 in A. karnatakaensis is producing 862 801 clusters throughout section Nidulantes. Aspergillus and Peni- a related compound which is further modified to a karnataka- 863 802 cillium SMGC from the Minimum Information on Biosynthetic furan (60), confirming our guilt by association approach. The 864 803 Gene clusters (MIBiG) database (50) were compared to SMGC detection of the sterigmatocystin gene clusters is in accor- 865 804 of the dataset using protein BLAST. We identified best hits dance with previous reports (14) proving the robustness of our 866 805 using a percent identity, query coverage and subject coverage method. Generally, SMGCs investigated in A. nidulans are 867 806 cutoff of 95%. Since we predict members of a family to pro- found throughout section Nidulantes, emphasizing that they 868 116 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 7 869 931 870 932 871 933 872 934 873 935 874 936 875 937 876 938 877 939 878 940 879 941 880 942 881 943 882 944 883 Color Key 945 884 and Histogram 946 885 947

886 Count 948 887 949

888 0 100 250 950 889 0 20 40 60 80 951 Value 890 952 891 A. tetrazonus 953 A. nidulans A. foveolata 892 A. acristatulus 954 A. neoechinulatus 893 A. indicus 955 A. similis 894 A. fructiculosus 956 A. falconensis 895 A. navahoensis 957 A. recurvatus A. stercoraria 896 A. desertorum 958 A. multicolor 897 A. unguis 959 A. stella−maris 898 A. venezuelensis 960 A. filifera 899 A. olivicola 961 A. undulatus 900 A. amoeneus 962 A. versicolor A. sydowii 901 A. aurantiobrunneus 963 A. varians 902 A. spectabilis 964 A. crustosus 903 A. karnatakaensis 965 A. ochraceoroseus 904 A. niger ATCC 1015 966 A. aculeatinus 905 A. terreus 967 A. fumigatus Af293 A. clavatus 906 P. chrysogenum 968 907 969 Terrein F9775 908 Penicillin 970 TrypacidinSorbicillinAspyridone AlternariolCichorine YanuthoneFungisporin D AzanigeroneTerrequinone A. filifera Fellutamide B Acetylaranotin A. similis

Asperfuranone A. unguis A. indicus Naphthopyrone A. terreus A. varians 909 Echinocandin B A. sydowii 971 A. olivicola A. clavatus

Fumiquinazolines A. nidulans A. foveolata A. versicolor A. multicolor A. crustosus A. undulatus A. tetrazonus A. spectabilis A. amoeneus A. recurvatus A. stercoraria A. falconensis A. aculeatinus A. acristatulus Endocrocin,Pseurotin Emericellin A, Fumagillin A. desertorum Asperthecin, TAN−1612 A. stella − maris A. navahoensis 910 A. fructiculosus 972 P. chrysogenum A. venezuelensis A. neoechinulatus A. karnatakaensis A. ochraceoroseus DRAFTA. fumigatus Af293

911 1015 A. niger ATCC 973 Hexadehydro−astechrome (HAS) A. aurantiobrunneus Terreic acid, 6−methylsalicyclic acid 912 Emericellamide A / emericellamide B 974 Sterigmatocystin, Aflatoxin/sterigmatocystin 913 Austinol/dehydroaustinol, locus 1, Terretonin 975 914 Fig. 6. SMGC diversity throughout the dataset. Left: Annotation of SMGC families with known compounds from the MIBiG database. Right: Comparison of SMGC families 976 915 between species. The heatmap shows the amount of shared SMGC families in percent indicated by cell color. Column dendrogram of organisms is clustered hierarchically 977 916 according to SMGC families. 978 917 979 918 980 919 981 920 982 921 983 922 984 923 985 924 986 925 987 926 988 927 989 928 990 929 991 930 992 117 8 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 993 are conserved. Some species show known SMGC from species Data collection. Protein, gff, smurf, interpro and go data was col- 1055 994 of other sections, e.g. the hexadehydro-astechrome SMGC or lected from JGI( A customized version of SMURF (63) was used 1056 995 the acetylaranotin. Using our comparative genomics approach to annotate secondary metabolite gene clusters throughout draft 1057 Aspergillus genomes. Protein sequences, smurf, interpro, GO and 996 we can identify horizontal gene transfers of gene clusters faster gff annotations were obtained from jgi (https://genome.jgi.doe.gov/). 1058 997 than on a single case basis (61, 62). 1059 998 Creating protein families.. Protein families were created using single 1060 999 PKS and NRPS show diversity in closely related species. The linkage on bidirectional protein BLAST (64) hits with a percent 1061 identity of at least 50% and sum of coverage (query and subject) of 1000 differences in abundance of known gene clusters led us to at least 130% (27). 1062 1001 investigate how many non-redundant SMGC are added to the 1063 1002 dataset per genome sequenced. Taking A. nidulans SMGC Whole genome phylogeny. Whole genome phylogeny was con- 1064 1003 families as the starting point, we identified the number of structed from alignment of 200 bidirectional best hits between 1065 species using RAxML (23), MAFFT (24), and Gblocks (25). 1004 non-homologous SMGC families by addition of new species 1066 1005 (Fig.7). The plot is divided into clades as defined in figure 1. Creating SMGC families.. SMGC families were created according to 1067 1006 The first group shows the addition of new SMGC families from (27). In brief, bidirectional blast hits between proteins of secondary 1068 1007 reference species — most new SMGC families are added here. metabolite gene clusters were aggregated into a cluster vs cluster 1069 1008 Following this, gene clusters of species from section Nidulantes similarity score network. Subsequently, random walk clustering was 1070 used on this network to create SMGC families (65). 1009 are added. The second group includes clades A. aenei and 1071 1010 A. versicolor. Showing that organisms in the same section Genetic dereplication. MIBiG files were handled using biopython 1072 1011 can still add to the non-redundant repertoire of all classes (66) and subsetted for entries from Aspergillus and Penicillium 1073 1012 of SM genes. The third group consists of SMGC families species. Selected entries were compared to secondary metabolite 1074 1013 proteins of the dataset using protein BLAST (64) and best hits, 1075 from the clade II. PKS and NRPS families increase drastically. which suffice a 95% pident, 95% query coverage and 95% subject 1014 NPRS-Like, PKS-Like, HYBRID, TC, DMATS only increase coverage cutoff, annotated in our dataset. 1076 1015 slightly. The last group consists of clade I species. Only 1077 1016 new NRPS, PKS and PKS-like are added to the set. Hence, ML phylogenies of NRPS. Protein sequences were aligned using 1078 1017 only the most abundant classes show new families in closely clustalo (67), trimmed with trimal (68) using a gapthreshold of 1079 1018 related species. PKS and NRPS gene clusters seem to be the 0.8, a similarity threshold of 0.001, keeping 80% of columns. ML 1080 1019 most abundant and diverse secondary metabolite genes on a phylogenies were created using iqtree (69) with substitution model 1081 1020 section level. This indicates that these classes are subject to chosen according to ModelFinder Plus (70) and 1000 times ultrafast 1082 1021 horizontal gene transfer or, in the case of NRPS, extensive bootstrap (71). 1083 1022 domain rearrangement events. 1084 1. Momany M (2002) Polarity in filamentous fungi: establishment, maintenance and new axes. 1023 Our genetic dereplication analysis showed that elucidated Current opinion in microbiology 5(6):580–5. 1085 1024 gene clusters of A. nidulans are biased towards section Nidu- 2. Goldman GH, Kafer E (2004) Aspergillus nidulans as a model system to characterize the 1086 1025 DNA damage response in eukaryotes. Fungal Genetics and Biology 41(4):428–442. 1087 lantes and clade I. Despite this, it is still informative to mine 3. Virag A, Lee MP, Si H, Harris SD (2007) Regulation of hyphal morphogenesis by cdc42 and 1026 closely related species for new SMGC. As the analysis of non- rac1 homologues in Aspergillus nidulans. Molecular Microbiology 66(6):1579–1596. 1088 1027 redundant SMGC added per genome shows, we can expect 4. Shukla N, Osmani AH, Osmani SA (2017) Microtubules are reversibly depolymerized in re- 1089 sponse to changing gaseous microenvironments within Aspergillus nidulans biofilms. Molec- 1028 that the species sequenced in this study will supply novel ular Biology of the Cell 28(5):634–644. 1090 1029 SMGC. Although most new SMGC are gained from Aspergilli 5. de Waard MA, van Nistelrooy JG (1979) Mechanism of resistance to fenarimol in Aspergillus 1091 1030 of different sections, closely related species still yield a vast nidulans. Pesticide Biochemistry and Physiology 10(2):219–229. 1092 6. Osherov N, Kontoyiannis DP, Romans A, May GS (2001) Resistance to itraconazole in As- 1031 amount of non-redundant PKS and NRPS. pergillus nidulans and Aspergillus fumigatus is conferred by extra copies of the A. nidulans P- 1093 1032 450 14alpha-demethylase gene, pdmA. The Journal of antimicrobial chemotherapy 48(1):75– 1094 1033 81. 1095 Conclusion 7. Rocha EMF, Almeida CB, Martinez-Rossi NM (2002) Identification of genes involved in 1034 terbinafine resistance in Aspergillus nidulans. Letters in Applied Microbiology 35(3):228–232. 1096 DRAFT8. Hubka V, et al. (2016) A reappraisal of Aspergillus section Nidulantes with descriptions of 1035 We de novo sequenced species in section Nidulantes and related 1097 1036 two new sterigmatocystin-producing species. Plant Systematics and Evolution 302(9):1267– 1098 their genetic content to the genome of the model organism A. 1299. 1037 nidulans. Our analysis shows that general regulators charac- 9. Engelhart S, et al. (2002) Occurrence of toxigenic Aspergillus versicolor isolates and sterig- 1099 1038 terized in A. nidulans are distributed throughout its section. matocystin in carpet dust from damp indoor environments. Applied and Environmental Micro- 1100 biology 68(8):3886–3890. 1039 Protein family analysis indicated that in some cases we iden- 10. Carmona EC, et al. (2005) Production, purification and characterization of a minor form of 1101 1040 tified paralogs, pointing to a different function. In the case xylanase from Aspergillus versicolor. Process Biochemistry 40(1):359–364. 1102 1041 of LaeA paralogs this could mean that they function on dif- 11. Alker AP, Smith GW, Kim K (2001) Characterization of Aspergillus sydowii (Thom et Church), 1103 a fungal pathogen of Caribbean sea fan corals. Hydrobiologia 460:105–111. 1042 ferent SMGCs than LaeA. Key cell cycle, cytoskeletal, and 12. Terao K (1983) Sterigmatocystin-a masked potent carcinogenic mycotoxin. Toxin Reviews 1104 1043 polarity genes are present throughout all species confirming 2(1):77–110. 1105 1044 13. Veršilovskis A, de Saeger S (2010) Sterigmatocystin: Occurrence in foodstuffs and analytical 1106 A. nidulans role as model organism. Protein families provide methods - an overview. Molecular Nutrition and Food Research 54(1):136–147. 1045 insights into clade specific adaptations with loss of proteins in 14. Rank C, et al. (2011) Distribution of sterigmatocystin in filamentous fungi. Fungal Biology 1107 1046 clade III and IV. Gene cluster predictions were confirmed by 115(4-5):406–420. 1108 15. Raper KB (1946) the Development of Improved Penicillin-Producing Molds. Annals of the 1047 data on analytical studies in the case of sterigmatocystin and New York Academy of Sciences 48(2):41–56. 1109 1048 karnatakafurans. Hence guilt by association based annotation 16. Holt G, MacDonald KD (1968) Isolation of strains with increased penicillin yield after hybridiza- 1110 1049 of general regulators and SMGCs is a reliable method to newly tion in Aspergillus nidulans. Nature 219(5154):636–637. 1111 17. Diez B, Ii V, Martin JF, Barredosll JL (1990) The Cluster of Penicillin Biosynthetic Genes. 1050 sequenced fungi. Biochemistry 265(27):16358–16365. 1112 1051 18. Nielsen ML, et al. (2011) A genome-wide polyketide synthase deletion library uncovers novel 1113 1052 genetic links to polyketides and meroterpenoids in Aspergillus nidulans. FEMS Microbiol Lett 1114 Materials and Methods 321(2):157–166. 1053 19. Klejnstrup ML, et al. (2012) Genetics of Polyketide Metabolism in Aspergillus nidulans. Vol. 2, 1115 1054 pp. 100–133. 1116 118 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 9 1117 1179 SM type ● PKS● HYBRID● NRPS 1118 1180 A. aurantiobrunneus 1119 start A. varians A. unguis 1181 1120 references clade_IV clade_III clade_II clade_I 1182

● 250 ● 1121 ● ● 1183 ● ● ● ● ● ● ● 1122 ● 1184 ● ● ● ● 1123 ● 1185 200 ● 1124 ● 1186 ● ● ● ● ● ● ● ● ● ● ● 1125 ● ● ● 1187 ● ● ● ● ● ● ● ● 1126 150 ● ● 1188 ● ● ● ● ● 1127 ● 1189 ● ● ● 1128 Count ● 1190 ● ● 100 ● ● 1129 ● 1191 ● ● 1130 ● 1192 ● ● 1131 ● 1193 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1132 ● ● ● ● 1194 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1133 ● 1195 ● ● ● ● 1134 0 ● ● 1196 1135 1197 1136 1198 1137 1199 A. terreus 5 A. filifera 19 A. filifera A. similis 30 A. unguis 22 A. clavatus 3 A. clavatus A. indicus 31 A. nidulans 1 A. varians 12 A. varians A. sydowii 14 A. sydowii A. olivicola 18

1138 34 A. foveolata 1200 A. versicolor 15 A. versicolor A. multicolor 23 A. multicolor A. crustosus 10 A. undulatus 17 A. aculeatinus 6 A. aculeatinus A. tetrazonus 35 A. tetrazonus A. spectabilis 11 A. amoeneus 16 A. recurvatus 26 A. stercoraria 25 A. falconensis 28 A. falconensis A. acristatulus 33 A. desertorum 24 A. stella − maris 21 P. chrysogenum 2 A. navahoensis 27 A. navahoensis 1139 A. fructiculosus 29 1201 A. karnatakaensis 9 A. venezuelensis 20 A. venezuelensis A. ochraceoroseus 8 A. ochraceoroseus A. fumigatus Af293 4 A. neoechinulatus 32 A. neoechinulatus A. niger ATCC 1015 7 A. niger ATCC

1140 A. aurantiobrunneus 13 1202 1141 1203 1142 1204 1143 Fig. 7. Non-redundant SMGC added per genome. Scatterplot showing the number of non-redundant SMGC added per genome; starting with A. nidulans SMGC families. The 1205 type of SMGC is color coded. Blue rectangles indicate groups of (from left to right) reference species, A. aenei and A. versicolor clade, A. stellatus clade, and A. nidulans clade. 1144 1206 1145 1207

1146 A. terreus 100% 1208 1147 A 1209 1148 1210 1149 65% 1211 1150 A. fumigatus 1212 1151 1213 1152 1214 1153 1215 1154 A. multicolor 1216 1155 1217 1156 1218 1157 1219 1158 A. nidulans DRAFT 1220 1159 A. aculeatinus 464400 1221 99 A. niger ATCC 1015 1189171 100 A. fumigatus Af293 1586 1160 100 A. terreus 8636 1222 A. fumigatus Af293 4427 A. ochraceoroseus 431713 1161 A. crustosus 397316 1223 100 A. karnatakaensis 213387 1162 BC A. spectabilis 193182 1224 100A. amoeneus 230978 100 A. versicolor 158415 1163 A. sydowii 145445 1225 A. varians 355222 A. aurantiobrunneus 116965 1164 A. terreus 2565 1226 100 A. stella−maris 201771 100 A. venezuelensis 199997 1165 100 100 A. filifera 46477 1227 100 A. olivicola 181633 1166 100 A. undulatus 272819 1228 A. unguis 119415 1167 A. multicolor 175902 1229 100 A. nidulans 6139 sidC A. multicolor 187337 99 A. tetrazonus 196501 1168 100 A. foveolata 26710 1230 A. acristatulus 55872 98 A. neoechinulatus 189164 1169 100 A. indicus 33643 1231 A. similis 123540 1170 0.1 A. desertorum 7520 1232 100A. stercoraria 280911 A. fructiculosus 209547 1171 100A. falconensis 177220 1233 A. recurvatus 184570 1172 A. navahoensis 103224 1234 0.1 1173 1235 1174 Fig. 8. synteny between Aspergilli. A Synteny plot for HAS locus in A. terreus, A. fumigatus and A. multicolor, and best hit locus in A. nidulans. hasA-H (54) are conserved in A. 1236 1175 multicolor, while hasG is missing in A. terreus. Thus we expect A. multicolor to synthesize full HAS (A. terreus only produces astechrome according to Bok et al. (55)). A. 1237 1176 nidulans contains a region which is syntenic to the upstream region of the HAS locus. Hence, we expect that the HGT event occured at Chr VII of A. nidulans. Homologs for 1238 these HAS proteins were only found in the family. B ML phylogeny of HAS and (C) sidC-homolog NRPSs. Protein sequences were aligned using clustal omega and trimmed 1177 using trimal prior to maximum likelihodd analysis using iqtree. HAS homologs (left) show phylogenetic proximity. Comparison to distances of a conserved 1239 1178 1240 119 10 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Theobald et al. 1241 20. Sanchez JF, et al. (2010) Molecular genetic analysis of the orsellinic acid/F9775 genecluster Degradation of Plant Cell Wall Polysaccharides. Microbiology and Molecular Biology Reviews 1303 1242 of Aspergillus nidulans. Mol. BioSyst. 6(3):587–593. 65(4):497–522. 1304 21. Galagan JE, et al. (2005) Sequencing of Aspergillus nidulans and comparative analysis with 47. Kocsubé S, et al. (2016) Aspergillus is monophyletic: Evidence from multiple gene phyloge- 1243 A. fumigatus and A. oryzae. Nature 438(7071):1105–1115. nies and extrolites profiles. Studies in Mycology 85:199–213. 1305 1244 22. Chen A, et al. (2016) Aspergillus section Nidulantes (formerly Emericella): Polyphasic taxon- 48. Frisvad JC, Larsen TO (2015) Chemodiversity in the genus Aspergillus. Applied Microbiology 1306 1245 omy, chemistry and biology. Studies in Mycology pp. 1–118. and Biotechnology 99(19):7859–7877. 1307 23. Stamatakis A (2014) RAxML version 8: A tool for phylogenetic analysis and post-analysis of 49. Frisvad JC, Andersen B, Thrane U (2008) The use of secondary metabolite profiling in chemo- 1246 large phylogenies. Bioinformatics 30(9):1312–1313. taxonomy of filamentous fungi. Mycological Research 112(2):231–240. 1308 1247 24. Katoh K, Misawa K, Kuma Ki, Miyata T (2002) MAFFT: a novel method for rapid multiple 50. Medema MH, et al. (2015) Minimum Information about a Biosynthetic Gene cluster. Nature 1309 sequence alignment based on fast Fourier transform. Nucleic acids research 30(14):3059– Chemical Biology 11(9):625–631. 1248 3066. 51. Yeh Hh, et al. (2016) Resistance Gene-Guided Genome Mining: Serial Promoter Exchanges 1310 1249 25. Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in in Aspergillus nidulans Reveal the Biosynthetic Pathway for Fellutamide B, a Proteasome 1311 1250 phylogenetic analysis. Molecular biology and evolution 17(4):540–552. Inhibitor. 1312 26. de Vries RP, et al. (2016) Comparative genomics reveals high biological diversity and specific 52. Lim FY, et al. (2012) Genome-based cluster deletion reveals an endocrocin biosynthetic path- 1251 adaptations in the industrially and medically important fungal genus Aspergillus. pp. 1–45. way in Aspergillus fumigatus. Applied and Environmental Microbiology 78(12):4117–4125. 1313 1252 27. Vesth TC, et al. (2018) The genomes of Aspergillus section Nigri reveal drivers in fungal 53. Szewczyk E, et al. (2008) Identification and characterization of the asperthecin gene cluster 1314 1253 speciation (in preparation). In preparation. of Aspergillus nidulans. Applied and Environmental Microbiology 74(24):7607–7612. 1315 28. Ries LNA, Beattie SR, Espeso EA, Cramer RA, Goldman GH (2016) Diverse Regulation of 54. Yin WB, et al. (2013) A nonribosomal peptide synthetase-derived iron(III) complex from 1254 the CreA Carbon Catabolite Repressor in Aspergillus nidulans. Genetics 203(1):335–352. the pathogenic fungus aspergillus fumigatus. Journal of the American Chemical Society 1316 1255 29. Denison SH (2000) pH regulation of gene expression in fungi. Fungal Genetics and Biology 135(6):2064–2067. 1317 29(2):61–71. 55. Bok JW, et al. (2015) Fungal artificial chromosomes for mining of the fungal secondary 1256 30. Bok JW, Keller NP (2004) LaeA , a Regulator of Secondary Metabolism in Aspergillus spp. metabolome. BMC Genomics 16(1):1–10. 1318 1257 Eukaryotic Cell 3(2):527–535. 56. Guo CJ, et al. (2012) Molecular genetic characterization of a cluster in A. terreus for biosyn- 1319 1258 31. Shroff RA, Lockington RA, Kelly JM (1996) Analysis of mutations in the creA gene involved thesis of the meroterpenoid terretonin. Organic Letters 14(22):5684–5687. 1320 in carbon catabolite repression in Aspergillus nidulans. Can J Microbiol 42(9):950–959. 57. Lo HC, et al. (2012) Two separate gene clusters encode the biosynthetic pathway for the 1259 32. Oakley CE, et al. (2016) Discovery of McrA, a master regulator of Aspergillus secondary meroterpenoids austinol and dehydroaustinol in Aspergillus nidulans. Journal of the Ameri- 1321 1260 metabolism. Molecular microbiology 103(November 2016):347–365. can Chemical Society 134(10):4709–4720. 1322 1261 33. Christensen U, et al. (2011) Unique regulatory mechanism for D-galactose utilization in As- 58. Matsuda Y, Awakawa T, Abe I (2013) Reconstituted biosynthesis of fungal meroterpenoid 1323 pergillus nidulans. Applied and Environmental Microbiology 77(19):7084–7087. andrastin A. Tetrahedron 69(38):8199–8204. 1262 34. Futagami T, et al. (2011) Putative stress sensors WscA and WscB are involved in hypo- 59. Valiante V, et al. (2017) Discovery of an Extended Austinoid Biosynthetic Pathway in As- 1324 1263 osmotic and acidic pH stress tolerance in aspergillus nidulans. Eukaryotic Cell 10(11):1504– pergillus calidoustus. ACS Chemical Biology 12(5):1227–1234. 1325 1264 1515. 60. Manniche S, Sprogøe K, Dalsgaard PW, Christophersen C, Larsen TO (2004) Karnatakafu- 1326 35. Oakley CE, et al. (2017) Discovery of McrA, a master regulator of Aspergillus sec- rans A and B: Two dibenzofurans isolated from the fungus aspergillus karnatakaensis. Jour- 1265 ondary metabolism. Molecular Microbiology 103(2):347–365. nal of Natural Products 67(12):2111–2112. 1327 1266 36. Harris SD, Momany M (2004) Polarity in filamentous fungi: moving beyond the yeast 61. Khaldi N, Collemare J, Lebrun MH, Wolfe KH (2008) Evidence for horizontal transfer of a 1328 paradigm. Fungal genetics and biology : FG & B 41(4):391–400. secondary metabolite gene cluster between fungi. Genome biology 9(1):R18. 1267 37. Riquelme M (2013) Tip Growth in Filamentous Fungi: A Road Trip to the Apex. Annual 62. Lawrence DP, Kroken S, Pryor BM, Arnold AE (2011) Interkingdom gene transfer of a hybrid 1329 1268 Review of Microbiology 67(1):587–609. NPS/PKS from bacteria to filamentous Ascomycota. PloS one 6(11):e28231. 1330 1269 38. Si H, Rittenour WR, Harris SD (2016) Roles of Aspergillus nidulans Cdc42/Rho GTPase 63. Khaldi N, et al. (2010) SMURF: Genomic mapping of fungal secondary metabolite clusters. 1331 regulators in hyphal morphogenesis and development. Mycologia 108(3):543–555. Fungal genetics and biology : FG & B 47(9):736–41. 1270 39. Harris SD, Hamer L, Sharpless KE, Hamer JE (1997) The Aspergillus nidulans sepA gene 64. Camacho C, et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics 1332 1271 encodes an FH1/2 protein involved in cytokinesis and the maintenance of cellular polarity. 10(1):421. 1333 1272 The EMBO journal 16(12):3474–83. 65. Pons P, Latapy M (2005) Computing communities in large networks using random walks. 1334 40. Doshi P, Bossie CA, Doonan JH, May GS, Morris NR (1991) Two alpha-tubulin genes Physics and Society p. arXiv:physics/0512106. 1273 of Aspergillus nidulans encode divergent proteins. Molecular & general genetics : MGG 66. Cock PJ, et al. (2009) Biopython: Freely available Python tools for computational molecular 1335 1274 225(1):129–41. biology and bioinformatics. Bioinformatics 25(11):1422–1423. 1336 41. Kirk KE, Morris NR (1991) The tubB alpha-tubulin gene is essential for sexual development 67. Sievers F, et al. (2011) Fast, scalable generation of high-quality protein multiple sequence 1275 in Aspergillus nidulans. Genes & development 5(11):2014–23. alignments using Clustal Omega. Molecular systems biology 7(1):539. 1337 1276 42. Pan F, Malmberg RL, Momany M (2007) Analysis of septins across kingdoms reveals orthol- 68. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: A tool for automated align- 1338 1277 ogy and new motifs. BMC evolutionary biology 7:103. ment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973. 1339 43. Hernández-Rodríguez Y, et al. (2014) Distinct septin heteropolymers co-exist during multicel- 69. Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ (2015) IQ-TREE: A fast and effective 1278 lular development in the filamentous fungus Aspergillus nidulans. PloS one 9(3):e92819. stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and 1340 1279 44. Nishihama R, Onishi M, Pringle JR (2011) New insights into the phylogenetic distribution and Evolution 32(1):268–274. 1341 1280 evolutionary origins of the septins. Biological Chemistry 392(8-9):681–7. 70. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS (2017) ModelFinder: 1342 45. De Souza CP, et al. (2013) Functional Analysis of the Aspergillus nidulans Kinome. PLoS Fast Model Selection for Accurate Phylogenetic Estimates. Nature Methods 14(6):587–591. 1281 ONE 8(3). 71. Minh BQ, Nguyen MAT, Von Haeseler A (2013) Ultrafast approximation for phylogenetic boot- 1343 1282 46. de Vries RP, Visser J, Ronald P, de Vries, R., P. (2001) AspergillusDRAFT Enzymes Involved in strap. Molecular Biology and Evolution 30(5):1188–1195. 1344 1283 1345 1284 1346 1285 1347 1286 1348 1287 1349 1288 1350 1289 1351 1290 1352 1291 1353 1292 1354 1293 1355 1294 1356 1295 1357 1296 1358 1297 1359 1298 1360 1299 1361 1300 1362 1301 1363 1302 1364 120 Theobald et al. PNAS | February 1, 2018 | vol. XXX | no. XX | 11 Chapter 6

Conclusions and perspectives

Individual comparisons on gene to gene basis were feasible because there were only few genome sequences available. Now projects like the Aspergillus sequencing project or the 1K fungal genome project provide a vast amount of information — creating a need for new methods. Comparative genomics and secondary metabolite gene networks are capable of processing this infor- mation and provide new insights into fungal evolution and biology. This thesis highlighted how comparative genomics can be used to identify gene dynamics over a large set of organisms and how it can be used to char- acterize de novo sequenced species. Chapter 2 presented how comparative genomics can be used to describe species on the section, clade and isolate level and determine the drivers for fungal speciation. Chapter 3 showed how secondary metabolite gene cluster (SMGC) net- works can be used to characterize SMGCs through many species at once. The method points out related SMGCs for analogous compounds in differ- ent species — a valuable tool to identify new leads for bioactive compounds. This also highlights how SMGCs evolved and how Aspergilli create new SMs. Finally, combining production data of SMs with gene cluster families led to the identification of MlfA, the NRPS producing the anit-cancer enhancing compound malformin. This is an important step to support the elucida- tion of SMGCs for desired compounds. Aspergilli have been investigated for their SMs during many years, offering a vast amount of data on produced compounds. Creating SMGC families of newly sequenced Aspergilli and combining them with the available metabolite data will provide many more SMGC leads for pharmaceutically relevant compounds. Chapter 5 prove that the established pipeline is applicable to other sections and scalable to more genomes. Here, we also show how protein families can be used to characterize newly sequenced species with annotations available from model organisms. The general regulators of secondary metabolism and their paralogs, which have been identified in our analysis, can serve as leads for overexpression

121 studies. The thesis also described how genus wide protein sets can be used to characterize the evolutionary history of SM proteins. Our analysis of PKS- NRPS and NRPS-PKS proteins pointed towards multiple events leading to their evolution — a new insight in hybrid phylogenies. The phylogeny which was established in chapter 4 will serve as a lead for new studies to reengi- neer hybrids. The field of combinatorial biosynthesis, which tries to create chimeras of these enzymes, has been hampered by the lack of leads of hybrids amenable to recombination. Generally, this thesis (and the manuscripts included in it) will serve as a go-to reference for future work on sections Nigri and Nidulantes. Charac- terizing their proteome in protein families enabled us to provide an overview of the distribution of each protein of interest throughout all species. Thus, instead of individual searches by BLAST others might use our dataset and the analyses in order to find a protein of interest. Providing the distribution of proteins throughout all species might provide others with key insights they need. Furthermore, the method and content on diversity and dynamics of SMGCs are transferable to other genera and will enable studies on analo- gous SM pathways in fungi. Imitating SMGC dynamics seen for analogous compounds might pave the way for synthetic biology approaches.

122 Appendix A

Appendix

A.1 Supplementary information: The genomes of Aspergillus section Nigri reveal drivers in fungal speciation

123 Materials and methods

Fungal strains Unless otherwise noted, the species examined were taken from the IBT Culture Collection of Fungi at the Technical University of Denmark (DTU). Strains employed in this study are denoted in SI 1.

Purification of DNA and RNA For all sequences generated for this study (SI 1), spores were defrosted from storage at -80oC and inoculated onto solid CYA medium. Fresh spores were harvested after 7-10 days and suspended in a 0.1% Tween solution. Spores were stored in solution at 5oC for up to three weeks. For generation of biomass, spores were inoculated in sterile liquid CYA medium and cultivated for 5-10 days at 30oC. Fungal mycelium was isolated from the liquid medium by filtration through Miracloth and flash-frozen in liquid nitrogen. DNA isolation was performed using a modified version of the standard phenol extraction [Green, Michael R., and Joseph. Sambrook. Molecular Cloning : Cold Spring Harbor Laboratory Press, 2012] and checked for quality and concentration using a NanoDrop (BioNordika, DK). RNA isolation was performed using the Qiagen RNAEasy Plant Mini Kit according to the manufacturer's instructions.

Biomass for all fungal strains was obtained from shake flasks containing 200 mL of complex media CYA, meaox, or CY20 depending on the strain (see SI 1). Biomass was isolated by filtering through Miracloth (Millipore, 475855-1R), freeze dried, and stored at -80oC. A sample of frozen biomass was subsequently used for RNA purification. First, hyphae were transferred to a 2 mL microtube, together with a 5mm steel bead (QIAGEN), placed in liquid nitrogen, then lysed using the QIAGEN TissueLyser LT at 45 Hz for 50 seconds. Then the QIAGEN RNeasy mini Plus Kit was used to isolate RNA., RLT Plus buffer (with 2-mercaptoethanol) was added to the samples, vortexed, and spun down. The lysate was then used in step 4 in the instructions provided by the manufacturer, and the protocol was followed from this step. For genomic DNA, a protocol based on Fulton et al. [Fulton, TM. “Microprep protocol for extraction of DNA from tomato and other herbaceous plants.” Plant Molecular Biology Reporter 13.3 (1995): 207–209] was used.

Protocol for preparation of fungal DNA The protocol described below has successfully been employed to isolate genome-sequencing grade genomic DNA for more than 200 different Aspergillus species. List of Materials

● D-Sorbitol (Sigma, S1876 – CAS 50-70-4) ● Tris-Base (Sigma 7-9, T1378 – CAS 7786-1) ● 37% HCl (Th. Geyer, 836,1000) ● EDTA (Merck, 324503 – CAS 6381-92-6)

124 ● Sodium chloride (NaCl) (AppliChem A1371,9010 – CAS 7647-14-5] ● Cetyl trimethylammonium bromide (CTAB) (Sigma 52365 – CAS 57-09-0) ● Sarkosyl NL (Sigma, L5777 – CAS 137-16-6) ● Polyvinylpyrrolidone (PVP) (Sigma PVP-40T – CAS 9003-39-8) ● Proteinase K (NEB P8107S) ● Potassium acetate (J. T. Baker 0129910025 – CAS 127-08-2) ● Phenol:Chloroform:Isoamylalcohol (25:24:1) (Sigma P3803) ● Sodium acetate (J. T. Baker 9914011001 – CAS 6131-90-4] ● 96 % Ethanol (vwr chemicals) ● 70 % Ethanol (vwr chemicals) ● Isopropanol (Merck, 109634 – CAS 67-63-0) ● Liquid nitrogen ● Sodium hydroxide (Sigma S5881 – CAS 1310-73-2) ● RNase A (Sigma R-4875 – CAS 9001-99-4)

Preparation of liquid media All solutions were autoclaved. Buffers ● 5M Potassium acetate (pH 7.5) ○ 122.5 g potassium acetate and ddH2O up to 250 mL ○ pH adjusted with acetic acid ● 3M Sodium acetate ○ 81.65 g sodium acetate ○ Add ddH2O up to 200 mL ● 1% PVP ○ 2 g PVP in 200 mL ddH2O ● 5% Sarkosyl ○ 10 g Sarkosyl in 200 mL ddH2O ● 1M Tris-HCl (pH 9) ○ 60.57 g Tris-base and 4.81 mL 37% HCl ○ Add ddH2O up to 500 mL ● 0.5M EDTA ○ 116.4 g EDTA ○ Add ddH2O up to 500 mL ○ Add sodium hydroxide pellets until pH reaches 8.0 ● Buffer A ○ 31.9 g sorbitol, 50 mL 1M Tris-HCl (pH 9), 5 mL 0.5M EDTA (pH 8) ○ Add ddH2O up to 500 mL ● Buffer B ○ 100 mL 1M Tris-HCl (pH 9), 50 mL 0.5M EDTA, 58.44 g NaCl, 10 g CTAB ○ Add ddH2O up to 500 mL ● TE (pH 9) ○ 1.21 g Tris-base, 0.37 g EDTA ○ Add ddH2O up to 1000 mL ● Lysis Buffer (for 10 mL pr. Sample) ○ 3.75 mL Buffer A ○ 3.75 mL Buffer B

125 ○ 1.5 mL 5 % Sarkosyl ○ 1 mL 1 % PVP ○ 100 μl Proteinase K ● RNase A ○ Dissolve 10 mg dry powder in 1 mL ddH2O

Equipment ● NanoDrop ND 1000 Spectrophotometer or NanoDrop Lite From Qiagen ● Qubit 1.0 fluorometer from Invitrogen and Qubit dsDNA BR Assay Kit (Q32853) from ThermoFisher. ● Mortar and pestle ● Centrifuge for 50 mL Falcon tubes at 4oC

Protocol 1. Pre-heat Buffer B to 65oC. 2. Prepare Lysis Buffer just before use and keep at 65oC. 3. Transfer freeze-dried mycelia into a mortar and cover with liquid nitrogen. a. Grind material and transfer to a 50 mL Falcon tube as soon as all liquid nitrogen has evaporated. b. Powder in the tube should not exceed the 5 mL mark, but a minimum of 3 mL is recommended. c. Note powder must not thaw. 4. Add 10 mL Lysis Buffer and mix vigorously by vortexing. 5. Incubate for 30 minutes at 65oC. a. Mix frequently by inverting the tube. 6. Add 3.35 mL 5 M potassium acetate. a. Mix gently by inverting the tube 5-7 times. b. Incubate solution 30 minutes on ice. 7. Centrifuge for 30 minutes at 5,000 g at 4oC. 8. Transfer the supernatant (~ 9 mL) to a new 50 mL Falcon tube and add 5 mL of phenol:chloroform:isoamylalcohol (25:24:1) a. Mix gently 5-7 times. 9. Centrifuge 20 minutes at 4,000 g at 4 oC. 10. Transfer the aqueous phase (~8 mL) to a new 50 mL Falcon tube. a. Note avoid any material from the interphase. 11. Add 100 μl RNase A (10 mg/mL) and mix gently. a. Incubate at room temperature for 30-60 minutes 12. Add 1/10 volume of 3M sodium acetate and 1 volume ice-cold 96% ethanol. (Alternatively, isopropanol can be used, but it may adversely influence A260/A280 measurements). 13. Incubate solution at 20oC for 30 minutes. 14. Centrifuge for 30 minutes at 10,000 g and 4oC. 15. Discard the supernatant. 16. Wash the pellet with 2 mL 70 % ethanol and pipette as much away without disturbing the pellet. 17. Dry the pellet at room temperature until all ethanol has evaporated (approximately 15 minutes).

126 18. Note: do not let the pellet dry out! 19. Dissolve the pellet in 500 μL TE. This may take an over-night incubation at room temperature with light shaking. Transfer DNA solution to a 2 mL Eppendorf tube. 20. Take an aliquot for DNA quality assessments (see below) and store the remaining DNA solution at -20 oC until further use. 21. For testing DNA quality: 22. Make a 20-fold dilution of the DNA solution (from step 19) in a 1.5 mL Eppendorf tube to a total volume of 100 μl. 23. Run 5-10 μl of the diluted sample on an agarose gel to estimate the quality and concentration. 24. Use the nanodrop for A260/A280 measurements. Ratios should be in the range of 1.6-2.2. 25. Use the Qubit to determine DNA concentration estimations. Good solutions fall in the range of 20-200 ng/μl DNA in stock solution.

DNA and RNA sequencing and assembly All genomes in this study, except for A. heteromorphus, A. eucalypticola, and A. sclerotioniger, and all transcriptomes were sequenced with Illumina. The genomes of A. heteromorphus, A. eucalypticola, and A. sclerotioniger were sequenced with PacBio.

For all genomic Illumina libraries, 100 ng of DNA was sheared to 270 bp fragments using the Covaris LE220 (Covaris) and size selected using SPRI beads (Beckman Coulter). The fragments were treated with end-repair, A- tailing, and ligated to Illumina compatible adapters (IDT, Inc) using the KAPA-Illumina library creation kit (KAPA biosystems).

For transcriptomes, stranded cDNA libraries were generated using the Illumina Truseq Stranded RNA LT kits. mRNA was purified from 1 µg of total RNA using magnetic beads containing poly-T oligos. mRNA was fragmented using divalent cations and high temperature. The fragmented RNA was reverse transcribed using random hexamers and SSII (Invitrogen) followed by second strand synthesis. The fragmented cDNA was treated with end-pair, A-tailing, adapter ligation, and 10 cycles of PCR.

The prepared libraries were quantified using KAPA Biosystem’s next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then multiplexed with other libraries, and library pools were prepared for sequencing on the Illumina HiSeq sequencing platform utilizing a TruSeq paired-end cluster kit, v3, and Illumina’s cBot instrument to generate clustered flowcells for sequencing. Sequencing of the flowcells was performed on the Illumina HiSeq2000 sequencer using a TruSeq SBS sequencing kit, v3, following a 2x150 indexed run recipe.

After sequencing, the genomic fastq files were QC filtered to remove artifacts/process contamination and assembled using Velvet (Zerbino DR, Birney E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5):821-829.). The resulting assemblies were used to create in silico long mate-pair libraries with inserts of 3000 +/-

127 90 bp, that were then assembled with the target fastq using AllPathsLG release version R47710, (Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 25;108(4).). Illumina transcriptome reads were assembled into consensus sequences using Rnnotator v. 3.3.2 (Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, Sherlock G, Snyder M, Wang Z. (2010) Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics 2010 11(1), 663.).

For the genomes of A. heteromorphus, A. eucalypticola, and A. sclerotioniger, amplified libraries were generated using a modified shearing version of Pacific Biosciences standard template preparation protocol. 5 μg of gDNA was used to generate each library. The DNA was sheared using a Covaris LE220 focused-ultrasonicator with their Red miniTUBES to generate fragments of 5 kb in length. The sheared DNA fragments were then prepared according to the Pacific Biosciences protocol using their SMRTbell template preparation kit, where the fragments were treated with DNA damage repair (ends repaired so that they were blunt-ended and 5’ phosphorylated). Pacific Biosciences hairpin adapters were then ligated to the fragments to create the SMRTbell template for sequencing. The SMRTbell templates were then purified using exonuclease treatments and size-selected using AMPure PB beads.

Sequencing primer was then annealed to the SMRTbell templates and Version P4 sequencing polymerase was bound to them. The prepared SMRTbell template libraries were sequenced on a Pacific Biosciences RSII sequencer using Version C2 chemistry and 2 hour sequencing movie run times. The three PacBio genomes datasets were assembled using HGAP3 (http://files.pacb.com/software/smrtanalysis/2.2.0/doc/smrtportal/help/! SSL!/Webhelp/CS_Prot_RS_HGAP_Assembly3.htm).

All genomes were annotated using the JGI annotation Pipeline (Grigoriev IV, Nikitin R, Haridas S, Kuo A, Ohm R, Otillar R, Riley R, Salamov A, Zhao X, Korzeniewski F, Smirnova T, Nordberg H, Dubchak I, Shabalov I. (2014) MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res. 42(1):D699-704.). Genome assembly and annotations are available at the JGI fungal genome portal MycoCosm (Grigoriev IV, Nikitin R, Haridas S, Kuo A, Ohm R, Otillar R, Riley R, Salamov A, Zhao X, Korzeniewski F, Smirnova T, Nordberg H, Dubchak I, Shabalov I. (2014) MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res. 42(1):D699-704.) (http://jgi.doe.gov/fungi) and have been deposited at DDBJ/EMBL/GenBank under the following accessions (TO BE PROVIDED UPON PUBLICATION).

Cultivation for secondary metabolite analysis Fungal strains were cultivated as 3-point cultures on Czapek yeast extract agar (CYA), Czapek yeast extract agar with 5% NaCl (CYAS) [Nielsen, Kristian Fog et al. “Review of Secondary Metabolites and Mycotoxins from the Aspergillus niger Group.” Analytical and Bioanalytical

128 Chemistry 395.5 (2009): 1225–1242], yeast extract sucrose agar (YES) media for 7 days in the dark at 25˚C. Three 6 mm ID plugs taken across of the cultures were then extracted using an (3:2:1) (ethyl acetate/dichloromethane/methanol) mixture and dissolved in methanol (Smedsgaard, J. “Micro-Scale Extraction Procedure for Standardized Screening of Fungal Metabolite Production in Cultures.” Journal of Chromatography a 760.2 (1997): 264–270.).

Chemical analysis of secondary metabolites All chemical analyses were done by reversed phase ultra high performance liquid chromatography (UHPLC) coupled to UV-Vis diode array detection (DAD) combined with either fluorescence detection (FLD) or high resolution mass spectrometry (HRMS). Three different methods were used: Method 1 UHPLC-DAD Pure UHPLC-DAD-FLD was performed using a Dionex RSLC Ultimate 3000 (Dionex, Sunnyvale, CA) system linked to an Agilent 1100 FLD detector (Agilent Technologies, Santa Clara, CA). The system was equipped with an Agilent Poroshell Phenyl-hexyl column (150 × 2.1mm, 2.6 mm), and run using a linear gradient of water–acetonitrile starting from 10% to 100% (both containing 50 ppm trifluoroacetic acid) over 8 minutes, then 100% acetonitrile for 2 minutes. The column temperature was 60˚C, the flow rate 0.8 mL/min, and the injection volume was 1 µL. The UV spectra 200-640 nm were matched against our internal database [Nielsen, Kristian Fog et al. “Review of Secondary Metabolites and Mycotoxins from the Aspergillus niger Group.” Analytical and Bioanalytical Chemistry 395.5 (2009): 1225–1242]. Method 2 UHPLC-DAD-HRMS was conducted on a Dionex RSLC Ultimate system linked to maXis HD QTOF MS (Bruker Daltonics, Bremen Germany). Separation was done on a Kinetex C18 column (100 × 2.1mm, 2.6 mm), with a linear gradient consisting of water and acetonitrile (both buffered with 20 mM formic acid), starting at 10% acetonitrile and increased to 100% in 10 minutes where it was held for 2 minutes, returned (0.4 mL/min, 40°C). Injection volume, depending on sample concentration, typically varied between 0.1-1 µL. Samples were analyzed in ESI+ and some also in ESI- full scan mode scanning m/z 100-1250. Data were analyzed by aggressive dereplication [Klitgaard, Andreas et al. “Aggressive Dereplication Using UHPLC– DAD–QTOF: Screening Extracts for up to 3000 Fungal Secondary Metabolites.” Analytical and Bioanalytical Chemistry 406.7 (2014): 1933–1943] using lists of compounds considered to be from black Aspergillus only (~350), a lists with all Aspergillus compounds (~2450), as well as a list of 1600 reference standards, of which 500 are known to come from Aspergillus. Unknown peaks were matched against Antibase2012 and dereplicated using accurate mass, isotope patterns, adduct patterns, logD, and UV/Vis data (Klitgaard, A. “Aggressive Dereplication Using UHPLC-DAD-QTOF: Screening Extracts for up to 3000 Fungal Secondary Metabolites.” (2016)).

Method 3 UHPLC-DAD-HRMS was conducted on an Agilent Infinity 1290 UHPLC system coupled to an Agilent 6550 QTOF MS. Separation was obtained on an Agilent Poroshell 120 phenyl-hexyl

129 column (2.1 x 250 mm, 2.7 µm) using a linear gradient of water and acetonitrile (both buffered with 20 mM formic acid), from 10% to 100% acetonitrile in 15 minutes, where it was held for 2 min. The flow was 0.35 mL/min and temperature 60°C. Injection volume was between 0.1-1 µL depending on the sample concentration.

Samples were analyzed in ESI+ and some also in ESI-full scan mode scanning m/z 100-1700 and with automatic MS/MS enabled for ion counts above 100 000 counts and with a quarantine time of 0.06 minutes. MS/MS spectra were obtained at 10, 20 and 40 eV [Kildgaard, Sara et al. “Article Accurate Dereplication of Bioactive Secondary Metabolites from Marine-Derived Fungi by UHPLC-DAD-QTOFMS and a MS/HRMS Library.” (2015)].

Full scan data were analyzed as above in Masshunter [Kildgaard, Sara et al. “Accurate Dereplication of Bioactive Secondary Metabolites from Marine-Derived Fungi by UHPLC-DAD- QTOFMS and a MS/HRMS Library.” Marine Drugs 12.6 (2014): 3681–3705]. MS/MS data were matched to our internal MS library (~1700 compounds) of reference standards and tentatively identified compounds [Kildgaard, Sara et al. “Accurate Dereplication of Bioactive Secondary Metabolites from Marine-Derived Fungi by UHPLC-DAD-QTOFMS and a MS/HRMS Library.” Marine Drugs 12.6 (2014): 3681–3705]. Genome annotations

Genome annotation All genomes were annotated based on the JGI annotation pipeline [Grigoriev, Igor V., Diego A. Martinez, and Asaf A. Salamov. “Fungal Genomic Annotation.” Applied Mycology and Biotechnology 6.C (2006): 123–142] as previously described [Kis-Papo, Tamar et al. “Genomic Adaptations of the Halophilic Dead Sea Filamentous Fungus Eurotium Rubrum.” Nature Communications 5 (2014): 3745].

Whole-genome phylogeny

Protein sequences of all organisms were compared using BLASTp (e-value cutoff 1e-05). Orthologous groups of sequences were constructed based on best bidirectional hits (BBH). Two hundred groups with a member from each species were selected and the sequences of each organism was concatenated into one long protein sequence. Concatenated sequences are aligned using MAFFT (thread 16) and well-aligned regions were extracted using Gblocks (-t = p -b4=5 -b5 = h). Trees were then constructed using multithreaded RAxML, the PROTGAMMAWAG model and 100 bootstrap replicates.

Prediction of Secondary Metabolite Gene Clusters For the prediction of secondary metabolite (SM) clusters, we developed a command-line Python script roughly following the SMURF algorithm [Khaldi, N. et al. SMURF: Genomic mapping of fungal secondary metabolite clusters. Fungal Genet. Biol. 47, 736–41 (2010)]. As input, the program takes genomic coordinates and the annotated PFAM domains of the predicted genes.

130 Based on the multi-domain PFAM composition of identified 'backbone' genes, it can predict seven types of SM clusters: 1) Polyketide synthases (PKSs), 2) PKS-like, 3) nonribosomal peptide-synthetases (NRPSs) 4) NRPS-like, 5) hybrid PKS-NRPS, 6) prenyltransferases (DMATS), and 7) terpene cyclases (TCs). Besides backbone genes, PFAM domains, which are enriched in experimentally identified SM clusters (SM-specific PFAMs) were used in determining the borders of gene clusters. The maximum allowed size of intergenic regions in a cluster was set to 3 kb and each predicted cluster was allowed to have up to 6 genes without SM-specific domains.

Prediction of Secreted Proteases Secretome prediction was done using an in-house adaptation of SignalP [Bendtsen, JD et al. “Improved Prediction of Signal Peptides: SignalP 3.0.” Journal of Molecular Biology 340.4 (2004): 783–795].

Identification of Fungal Metabolites Fungal metabolite extracts were prepared using one of the three following methods (Nielsen, Kristian Fog et al. “Review of Secondary Metabolites and Mycotoxins from the Aspergillus niger Group.” Analytical and Bioanalytical Chemistry 395.5 (2009): 1225–1242): ● Chloroform-methanol-acetone-ethylacetate extraction ● Micro-extraction using methanol-dichloromethane-ethyl acetate ● 75% methanol

Metabolites were analyzed by LC-DAD, LC-DAD-TOFMS, or LC-MS. Data was analyzed based on LC retention times, UV spectra or MS data and compared to an in-house adaptation of Antibase [Klitgaard, Andreas et al. “Aggressive Dereplication Using UHPLC–DAD–QTOF: Screening Extracts for up to 3000 Fungal Secondary Metabolites.” Analytical and Bioanalytical Chemistry 406.7 (2014): 1933–1943].

Gene-Compound Assignment Identification of conserved or highly similar fungal gene clusters was performed based on the gene cluster predictions above. The genomes were compared using the BLASTp function from the BLAST+ suite [Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).]. Presence/absence of an orthologous gene to a member in a gene cluster was based on a bi-directional best hit, with e<1e-100 and coverage of >90%. Presence/absence of a full gene cluster was based on the occurrence of the majority of the predicted members in a gene cluster, including the backbone synthetase in another species.

Detection of encoded CAZymes Each Aspergillus protein model was compared using BLASTp to proteins listed in the Carbohydrate-Active Enzymes database (www.cazy.org, [Lombard, Vincent et al. “The Carbohydrate-Active Enzymes Database (CAZy) in 2013.” Nucleic Acids Research 42.1 (2014): D490–D495]). Models with over 50% identity over the entire length of an entry in CAZy were directly assigned to the same family (or subfamily when relevant). Proteins with less than 50%

131 identity to a protein in CAZy were all manually inspected and conserved features, such as the catalytic residues, were searched whenever known. Because an identical 30% sequence identity results in widely different e-values (from non-significant to highly significant), for CAZy family assignments, we examine sequence conservation (percentage identity over CAZy domain length). Sequence alignments with isolated functional domains were performed in the case of multimodular CAZymes. The same methods were used for Penicillium chrysogenum and Neurospora crassa..

Genetic diversity

Mapping of genes shared by groups of species All predicted sets of protein sequences for the 38 genomes analyzed were aligned using the BLASTp function from the BLAST+ suite version 2.2.27 (e-value <= 10^-10 cutoff). These 1.444 whole-genome BLAST tables were analyzed to identify bidirectional hits in all pairwise comparisons. Using custom Python-scripts, homologs were identified within and across the genomes and grouped into sequence similar families using single linkage, if they met? the following criterion: \The sum of the alignment coverage between the pairwise sequences > 130%, the alignment identity between the pairwise sequences > 50%, and the hit must be found in both of the species BLAST output (reciprocal hits). Singletons were assigned a family having only one gene member. This allowed for identification of species unique genes, genes shared by sections, clades and sub-clades of species. All homologs were assigned functional and structural domains using InterPro version 48 [Jones, Philip et al. “InterProScan 5: Genome- Scale Protein Function Classification.” Bioinformatics 30.9 (2014): 1236–1240.] and checked for annotation and sequencing errors by investigating scaffold location and sequence identity.

For the analysis of the pan and core genomes of a subset of 38 fungal species used in this study, the orthologous and paralogous families were subsetted to include only the species of interest. Therefore the genes representing the core and unique portion of the genomes will adjust relative to the accompanying species.

Secondary metabolism Our implementation of SMURF [Khaldi, N. et al. SMURF: Genomic mapping of fungal secondary metabolite clusters. Fungal Genet. Biol. 47, 736–41 (2010)] was run on genomic data from 37 Aspergillus strains. Proteins of the resulting secondary metabolism gene clusters (SMGCs) were compared to each other by alignment using BLASTp (BLAST+ suite version 2.2.27, e- value <= 10^-10). Subsequently, a score based on BLASTp identity and shared proteins was created to determine the similarity between gene clusters as depicted in the formula below. Using these scores, we created a weighted network of SMGC clusters and used a random walk community detection algorithm (R version 3.3.2, igraph_1.0.1, [Pons, Pascal, and Matthieu Latapy. “Computing Communities in Large Networks Using Random Walks .” (2012)], cutoff 1 step) to determine families of SMGC clusters. Finally, we ran another round of random walk clustering on the communities which contained more members than species in the analysis.

132 ptailoring∗ntailoring pbackbone∗nbackbone ∗0.35+ ∗0.65 ttailoring tbackbone

∑ ( pident tailoring) ∑( piden tbackbones) ∗0.35+ ∗0.65 n xtailoring nma xbackbones To create a cluster similarity score, a combined score of tailoring and backbone enzymes was created. The sum of the BLASTp percent identity (pident) of all hits for tailoring enzymes between two clusters was divided by the maximum amount of tailoring enzymes (nmax) and multiplied by 0.35. Then the score for the backbone enzymes was calculated in the same manner but multiplied by 0.65 to give more weight to the backbone enzymes. The scores were added to create an overall cluster similarity score.

avg(piden ttailoring )∗0.35+avg( pident backbones)∗0.65

Identification of shared SMGC families at nodes of the phylogenetic tree A list containing organisms of each branch of the phylogenetic tree was created and compared to the list of organisms for each SMGC family. If all organisms of a family matched, the count on the corresponding node was increased by one.

Prediction of the Aurasperone B gene cluster

Lists of organisms for all SMGC families were compared to the lists of aurasperone B producing species and filtered for Interpro annotations containing the terms “cytochrome P450” or “methyltransferase”.

Primary metabolism Copy numbers were assessed using the homologous protein families generated during the analysis of genome diversity. The gene pathway associations were taken from the A. niger genome-scale model [Andersen, Mikael Rørdam, Michael Lynge Nielsen, and Jens Nielsen. “Metabolic Model Integration of the Bibliome, Genome, Metabolome and Reactome of Aspergillus niger.” Molecular Systems Biology 4 (2008): 1–13]. All proteins in the respective protein families were considered putative isozymes and were included in the copy number analyses.

Comparing the putative isoenzymes in the different species, gene sequences were aligned and clustered using neighbor-joining with MUSCLE v3.8.31 [Edgar, Robert C. “MUSCLE: Multiple Sequence Alignment with Improved Accuracy and Speed.” Proceedings - 2004 Ieee Computational Systems Bioinformatics Conference, Csb 2004 (2004): 728–729]. Resulting trees have been visualized and edited for publication using the python ETE Toolkit [Huerta-Cepas, Jaime, Joaquin Dopazo, and Toni Gabaldon. “ETE: a Python Environment for Tree Exploration.” Bmc Bioinformatics 11.6 (2010): 24]. Subcellular localizations for the genes included in the

133 analysis were predicted using the TargetP [Emanuelsson, O et al. “Predicting Subcellular Localization of Proteins Based on Their N-Terminal Amino Acid Sequence.” Journal of Molecular Biology 300.4 (2000): 1005–1016] webserver at http://www.cbs.dtu.dk/services/TargetP/.

134 SI 3: Phylogenetic relation and InterPro coverage of the Aspergillus genus species. A. Dendrogram of the phylogenetic relation between the 36 species.

The black boxes represent the homologous genes among the species branching from the nodes. The white boxes represent the genes unique to the specific species. B. Genome sizes of each species. The black coloured boxes represent the genes annotated by InterPro. C. Core genome size of each species. The blue coloured boxes represent the genes annotated by InterPro. D Species unique genes. The yellow coloured boxes represent the genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The numbers at right side of the boxes indicates the total number of annotated and not annotated genes. The bar scales are unique to each graph.

A. B. C. D. Tree Gene count Core genome Species unique genes

1161 A. luchuensis CBS 106.47 13,530 4,610 1,161 101 192 75 A. luchuensis IFO 4308 11,475 4,579 192 8 701 A. piperis 12,069 4,590 701 880 A. eucalypticola 11,934 4,545 880 49 470 A. costaricaensis 11,966 4,601 470 62 782 272 39 A. tubingensis 12,322 4,599 782 655 A. neoniger 11,939 4,556 655 716 A. vadensis 12,132 4,584 716 1555 A. niger CBS 513.88 14,079 4,605 1,555 164 315 823 398 A. niger ATCC 13496 12,197 4,599 315 355 70 A. lacticoffeatus 13,083 4,612 355 301 A. niger ATCC 1015 11,910 4,620 301 115 95 3806 181 A. niger NRRL3 11,846 4,661 181 345 423 A. phoenicis 11,977 4,621 423 1547 A. welwitschiae 13,687 4,627 1,547 2280 1545 A. brasiliensis 13,000 4,656 1,545 1493 A. sclerotioniger 12,338 4,566 1,493 209 1093 A. carbonarius 11,624 4,792 1,093 3471 699 1511 A. sclerotiicarbonarius 12,570 4,628 1,511 219 1262 A. ibericus 11,680 4,512 1,262 2963 A. ellipticus 12,884 4,539 2,963 330 1944 A. heteromorphus 11,131 4,351 1,944 695 A. japonicus 12,023 4,542 695 626 767 171 A. violaceofuscus 12,082 4,544 767 4554 132 1160 A. indologenus 12,161 4,520 1,160 1131 A. uvarum 12,016 4,535 1,131 1752 938 A. brunneoviolaceus 12,074 4,545 938 331 788 279 A. fijiensis 12,018 4,541 788 815 624 89 905 A. aculeatinus 12,027 4,553 905 446 A. aculeatus 10,828 4,518 446 2407 1368 A. saccharolyticus 10,066 4,327 1,368 4851 1814 A. homomorphus 11,361 4,440 1,814 2705 A. nidulans 10,673 4,330 2,705 164081 1808 A. oryzae 12,030 4,657 1,808 3043 1971 A. flavus 12,604 4,680 1,971 1883 A. fumigatus Af293 9,780 4,296 1,883

0 3 6 9 0 5000 10000 15000 0 1000 2000 3000 4000 5000 0 1000 2000 3000

135 SI 5: Phylogenetic relation and InterPro coverage of the fungal species included in this study. A. Dendrogram of the phylogenetic relation between the 38 species. The black boxes represent the homologous genes among the species branching from the nodes. The white boxes represent the genes unique to the specific species. B. Genome sizes of each species. The black coloured boxes represent the genes annotated by InterPro. C. Core genome size of each species. The blue coloured boxes represent the genes annotated by InterPro. D Species unique genes. The yellow coloured boxes represent the genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The numbers at right side of the boxes indicates the total number of annotated and not annotated genes. The bar scales are unique to each graph.

A. B. C. D. Tree Gene count Core genome Species unique genes

1156 A. luchuensis CBS 106.47 101 13,530 2,465 1,156 192 75 A. luchuensis IFO 4308 11,475 2,445 192 8 701 A. piperis 12,069 2,457 701 880 A. eucalypticola 11,934 2,431 880 49 469 A. costaricaensis 62 11,966 2,466 469 782 264 39 A. tubingensis 12,322 2,469 782 655 A. neoniger 11,939 2,442 655 715 A. vadensis 12,132 2,446 715 1552 A. niger CBS 513.88 164 14,079 2,475 1,552 315 793 398 A. niger ATCC 13496 12,197 2,466 315 354 A. lacticoffeatus 65 13,083 2,471 354 301 A. niger ATCC 1015 95 11,910 2,472 301 108 3709 181 A. niger NRRL3 11,846 2,499 181 338 422 A. phoenicis 11,977 2,481 422 1544 A. welwitschiae 13,687 2,478 1,544 2097 1534 A. brasiliensis 13,000 2,501 1,534 1492 A. sclerotioniger 209 12,338 2,439 1,492 1091 A. carbonarius 11,624 2,587 1,091 3338 683 1508 A. sclerotiicarbonarius 217 12,570 2,489 1,508 1259 A. ibericus 11,680 2,412 1,259 2958 A. ellipticus 321 12,884 2,445 2,958 1940 A. heteromorphus 11,131 2,328 1,940 695 A. japonicus 624 12,023 2,444 695 767 171 A. violaceofuscus 3813 12,082 2,440 767 128 1156 A. indologenus 12,161 2,417 1,156 1129 A. uvarum 12,016 2,444 1,129 1728 935 A. brunneoviolaceus 331 12,074 2,440 935 788 279 A. fijiensis 12,018 2,438 788 400 606 81 905 A. aculeatinus 12,027 2,448 905 446 A. aculeatus 2357 10,828 2,430 446 1367 A. saccharolyticus 10,066 2,318 1,367 1025 1811 A. homomorphus 11,361 2,392 1,811 2639 A. nidulans 10,673 2,355 2,639 6676 1793 A. oryzae 2847 12,030 2,522 1,793 1956 A. flavus 70335 12,604 2,527 1,956 1828 A. fumigatus Af293 9,780 2,307 1,828 92909 3327 P. chrysogenum 11,395 2,420 3,327 7060 N. crassa 10,783 2,403 7,060

0 5 10 0 5000 10000 15000 0 1000 2000 0 2000 4000 6000

136 SI 10

A. Pan, core, unique genes and families of the Nigri clade species The total number of the proteins and families of all genes in the Tubingensis clade (pan genome, green), genes shared by all species (core genome, blue), and genes unique to the individual species (species unique genes, orange). The families were predicted using BLASTp alignments with cutoffs specific to the Aspergillus genus and single linkage clustering designed for this project. Niger clade species:A. lacticoffeatus, A. niger ATCC 1015, A. niger ATCC 13496, A. niger CBS 513.88, A. niger NRRL3 and A. phoenicis. B. Pan, core, unique genes and families of the Tubingensis clade species The total number of the proteins and families of all genes in the Tubingensis clade (pan genome, green), genes shared by all species (core genome, blue), and genes unique to the individual species (species unique genes, orange). The families were predicted using BLASTp alignments with cutoffs specific to the Aspergillus genus and single linkage clustering designed for this project. Tubingensis clade species: A. costaricaensis, A. eucalypticola, A. luchuensis CBS 106.47, A. luchuensis IFO 4308, A. neoniger, A. piperis, A. tubingensis and A. vadensis. A. B. 6 Niger clade species 8 Tubingensis clade species

97,367

90,000

75,092 74,986

60,000 59,050

30,000

18,624 16,431

8,548 7,194 8,175 6,874 4,160 4,126 0 Genes Families Genes Families

Pan genome Core genome Species unique genes

137 S11

Phylogenetic relation and InterPro coverage of the Tubingensis clade species. A. Dendrogram of the phylogenetic relation between the 8 species. The black boxes represent the homologous genes among the species branching from the nodes. The white boxes represent the genes unique to the specific species. B. Gene count of each species. The green coloured boxes represent the genes annotated by InterPro. C. Core genome size of each species. The blue coloured boxes represent the genes annotated by InterPro. D. Species unique genes. The orange coloured boxes represent the genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The numbers at right side of the boxes indicates the total number of annotated and not annotated genes. The bar scales are unique to each graph. A. B. C. D. Tree Gene count Core genome Species unique genes

1543 A. luchuensis CBS 106.47 13,530 9,435 1,543 274 398 190 A. luchuensis IFO 4308 11,475 9,376 398 20 797 A. piperis 12,069 9,385 797 1155 A. eucalypticola 11,934 9,297 1,155 643 628 A. costaricaensis 11,966 9,400 628 122 968 12,322 9,415 968 74986 102 A. tubingensis 791 A. neoniger 11,939 9,307 791 914 A. vadensis 12,132 9,371 914

0 1 2 3 4 5 0 5000 10000 15000 0 2500 5000 7500 10000 0 500 1000 1500

Dendrogram and InterPro coverage of the Niger clade species Phylogenetic relation and InterPro coverage of the Niger clade species. A. Dendrogram of the phylogenetic relation between the 6 species. The black boxes represent the homologous genes among the species branching from the nodes. The white boxes represent the genes unique to the specific species. B. Gene counts of each species. The green coloured boxes represent the genes annotated by InterPro. C. Core genome size of each species. The blue coloured boxes represent the genes annotated by InterPro. D. Species unique genes. The orange coloured boxes represent the genes annotated by InterPro. The grey coloured boxes represent genes with no annotation. The numbers at right side of the boxes indicates the total number of annotated and not annotated genes. The bar scales are unique to each graph. A. B. C. D. Tree Gene count Core genome Species unique genes

1870 A. niger CBS 513.88 14,079 9,809 1,870 269 404 768 A. niger ATCC 13496 12,197 9,820 404 532 A. lacticoffeatus 13,083 9,839 532 816 413 A. niger ATCC 1015 11,910 9,841 413 59050 309 281 A. niger NRRL3 11,846 9,900 281 660 A. phoenicis 11,977 9,841 660

2000 0 1 2 3 4 0 5000 10000 15000 0 2500 5000 7500 10000 0 500 1000 1500

138 SI 14: Sequence comparison of citrate synthase genes

The sequence of the putative citrate synthases identified in the copy number analysis has been compared using neighbor-joining as implemented in MUSCLE. Potential subcellular locations have been predicted using the TargetP method.

Location prediction Mitochondrial

Non-mitochondrial A. ellipticus 450445 A. carbonarius 171837 A. sclerotioniger 517794 A. ibericus 442782 A. sclerotiicarbonarius 355441 A. flavus 31967 A. oryzae 11870 A. ellipticus 404583 A. brasiliensis 68777 A. eucalypticola 455293 vA. luchuensis (kawachii) 17696 A. neoniger 455592 A. piperis 508326 A. luchuensis (acidus) 85419 A. costaricaensis 146117 A. tubingensis 164827 A. vadensis 486936 A. welvitschiae 180882 A. niger (phoenicis) 321674 A. niger (van Tiegheim) 318127 A. lacticoffeatus 337368 A. niger CBS 513.88 162340 A. niger NRRL3 11764 A. niger ATCC1015 1114423 A. sclerotiicarbonarius 347254 A. carbonarius 212258 A. sclerotioniger 538550 A. ibericus 479786 N. crassa 3316 A. heteromorphus 362715 P. chrysogenum 75563 A. nidulans 1753 A. flavus 35979 A. oryzae 6587 A. fumigatus 6324 A. homomorphus 409106 A. violaceofuscus 407321 A. japonicus 390552 A. indicus 397613 A. saccharolyticus 324929 A. aculeatinus 435157 A. bruxellensis 512338 A. fijiensis 456336 A. uvarum 366644 A. aculeatus 55221 A. ellipticus 371302 A. sclerotiicarbonarius 357753 A. ibericus 444266 A. carbonarius 208932 A. sclerotioniger 585932 A. brasiliensis 45597 A. luchuensis (kawachii) 17752 A. luchuensis (acidus) 36560 A. piperis 454687 A. tubingensis 135540 A. neoniger 431091 A. costaricaensis 249775 A. vadensis 422296 A. welvitschiae 144660 A. niger NRRL3 547 A. niger ATCC1015 1141371 A. niger (phoenicis) 297688 A. niger (van Tiegheim) 318975 A. lacticoffeatus 430647 A. niger CBS 513.88 163129 A. eucalypticola 398085 N. crassa 648 N. crassa 646 N. crassa 647 A. saccharolyticus 363296 A. homomorphus 388413 A. uvarum 308825 A. indicus 446148 A. japonicus 385692 A. violaceofuscus 258126 A. aculeatinus 384987 A. aculeatus 45620 A. bruxellensis 395043 A. fijiensis 395063 P. chrysogenum 84306 A. brasiliensis 46350 A. eucalypticola 356109 A. neoniger 422344 A. vadensis 127367 A. piperis 21367 A. luchuensis (kawachii) 20181 A. tubingensis 194625 A. costaricaensis 185315 A. luchuensis (acidus) 136420 A. niger (van Tiegheim) 368666 A. niger CBS 513.88 166715 A. niger NRRL3 3739 A. niger ATCC1015 1143506 A. lacticoffeatus 112163 A. niger (phoenicis) 270762 A. welvitschiae 182795 A. ibericus 403267 A. sclerotiicarbonarius 393699 A. carbonarius 211502 A. sclerotioniger 152105 A. heteromorphus139 381829 A. ellipticus 382223 0.1 A. nidulans 4707 A. fumigatus 7527 A. flavus 36917 A. oryzae 477 SI 17: Comparative growth profiling of Aspergilli from section Nigri and reference fungi. A. acidus CBS 106.47 A. aculeatinus CBS 121060 16872 A. aculeatus ATTC A. brasiliensis CBS 101740 A. carbonarius ITEM 5010 A. costaricaensis CBS 115574 A. ellipticus CBS 707.79 A. f jiensis CBS 313.89 A. homomorphus CBS 101889 A. ibericus CBS 121593 A. indologenus CBS 114.80 A. japonicus CBS 114.51 A. lacticoffeatus ( niger ) CBS 101883 A. neoniger CBS 115656 A. niger 1015 ATTC A. niger 13496 ATTC Tieghum van A. niger CBS 513.88 A. niger NRRL3 (N400) A. piperis CBS 112811 A. saccharolyticus JOP_1030-1 A. sclerotioniger CBS 115572 A. tubingensis CBS 134.48 A. uvarum CBS 121591 A. vadensis CBS 113365 A. violaceofuscus CBS 115571 A. nidulans A4 FGSC A. oryzae RIB40 A. fumigatus Af293 A. f avus NRRL3357 Penicillium chrysogenum Neurospora crassa OR74A

no carbon source

D-glucose

D-fructose

D-galactose

D-mannose

L-ribose

D-xylose

L-arabinose

L-rhamnose

D-galacturonic acid

D-glucuronic acid

cellobiose

maltose

lactose

raff nose

sucrose

arabinogalactan

beechwood xylan

birchwood xylan

arabic gum

guar gum

starch

apple pectin

citrus pectin

inulin

lignin

wheat bran

sugar beet pulp

citrus pulp

soy bean hulls

rice bran

cotton seed pulp

alfalfa meal

casein

cellulose

140 SI 18: Schematic representation of secondary metabolic gene cluster family identi cation. Protein blast comparisons between all gene cluster members are aggregated into one cluster similarity score. From this, a network is created with gene clusters as nodes (dots) and similarity score as edges (arrows). Subsequently, random walk clustering is used to nd communities of nodes inside the network. Green arrows indicate a high probability that nodes will be assigned to one community. Red arrows indicate community borders. Resulting families are containing communities of related SMGC. BLAST comparison Creating network Run random walk SMGC families clustering A

B

E

141 SI 19: Barplot describing SMGC family frequencies. The bars illustrate the presence of a SMGC family in a certain number of organisms. Numbers above bars show the total counts. 221 Family counts

Number of organisms in family

142 SI 20: Overview of predicted SMGC families (B-D) and their location in the phylogeny. A: Cladogram of species used for secondary metabolic gene cluster analysis in this study. Letter code indicates predicted clusters for fumonisin (f), aurasperone B (a) and an example gene cluster family (x) predicted exclusive - ly in biseriates. B: Example gene cluster found in five distantly related species (x). C: Predicted gene cluster family (f) containing a PKS in Aspergillus niger CBS 513.8 similar to FUM 1 from Fusarium oxysporum. D: Predicted gene cluster family for aurasperone B (a) including the aurofusarin gene cluster from Fusarium graminearum A. luchuensis CBS 106.47 a,f A A. luchuensis IFO 4308 a,f B A. piperis a,f A. niger ATCC 1015 A. eucalypticola a,f Q J A HNA C E HYBRID NAA I O K N D L M P A. costaricaensis a,f A. neoniger a,f A. niger CBS 513.88 A. tubingensis a,f Q J A HNA C E HYBRID RNA A I O Kcomposite L M S A. vadensis a,f A. niger CBS 513.88 a,f,x A. niger NRRL3 A.niger (lacticoffeatus) a,f,x D J A HNA C E HYBRID NAA I O K N D L M S A. niger ATCC 13496 a,f,x A.niger (phoenicis) a,f,x A. niger ATCC 13496 A. niger ATCC 1015 a,f,x D J A HNA C E HYBRID NAA I O K N D L M P A. niger NRRL3 a,f A. welwitschiae a,f A. phoenicis Q J A HNA C E HYBRID NAA I O K N D L M P A. brasiliensis a,x A. sclerotioniger a,f A. carbonarius a A. lacticoffeatus D J A HNA C E HYBRID NAA I O K N D L M P A. sclerotiicarbonarius a,f,x A. ibericus a,f A. welwitschiae A. ellipticus Q V A HNANA E HYBRID NAA I AC K N D L M Y A. heteromorphus a A. japonicus A. violaceofuscus A. sclerotiicarbonarius A T W C HYBRID A. indologenus A. uvarum A. brasiliensis A. fijiensis I NANA J A HNAC E HYBRID U A. brunneoviolaceus A. aculeatinus A. aculeatus A. homomorphus J A HNAC E HYBRID A. saccharolyticus A. homomorphus x 10 kb 30 kb 50 kb A. oryzae A. flavus 0 kb 20 kb 40 kb A. fumigatus Af293 A. nidulans P. chrysogenum

A. eucalypticola H C A D E I PKS A A. eucalypticola W B PKS D D C A. luchuensis CBS 106.47 H C A D E I PKS A A. luchuensis CBS 106.47 PKS B PKS D C A. heteromorphus PKS A. luchuensis IFO 4308 PKS B PKS D C A. welwitschiae K C A D E M PKS A A. welwitschiae PKS B PKS R D A. neoniger H C A D E I PKS A A. neoniger PKS B PKS D C A. niger ATCC 1015 H C A D E I PKS A A. niger CBS 513.88 A. niger ATCC 1015 PKS B PKS O D N L K H M E I C J H C A D E I PKS A A. niger NRRL3 H C A D E I PKS A A. niger CBS 513.88 PKS B PKS O D N L K H M E I C J A. niger ATCC 13496 H C A D E I PKS A A. niger NRRL3 PKS B PKS O D N L K H M E I C J A. phoenicis H C A D E I PKS A A. niger ATCC 13496 PKS B PKS O D N L K H M E I C J A. piperis H C A D E I PKS A A. phoenicis PKS B PKS O D N L K H M E I C J A. sclerotiicarbonarius K C A D E I PKS A A. lacticoffeatus PKS B PKS S D N L K H M E I C J A. sclerotioniger H C A D E J PKS A A. piperis PKS B C A. tubingensis H C A D E I PKS A

A. sclerotiicarbonarius PKS PKS O D N + L K H M E I C J A. vadensis H C A D E I PKS A A. brasiliensis H C A D E I PKS A A. sclerotioniger PKS P D Q C A. luchuensis IFO 4308 H C A D E J PKS A A. tubingensis PKS B PKS C A. ibericus H C A D E J PKS A A. vadensis PKS B C A. costaricaensis H C A D E I PKS A A. ibericus PKS P B C A. carbonarius H C A D E J PKS A

A. costaricaensis Y X U T V PKS B PKS DNAC A. lacticoffeatus H C A D E I PKS A Fusarium graminearum 10 kb 30 kb Z Y C I K X PKS N W D N 10 kb 0 kb 20 kb 40 kb 0 kb 20 kb

143 SI 21: Overview of SM families detected in 27 different Aspergillus isolates Species Compound families Notoamides Oxalines Okaramins Aurantiamins Atromentins Pyranonigrins Tensidols Fumonisins Malformins Kotanins Ochratoxins Geodins Calbistrins Aculenes Aflavinins Naphtho-gamma-pyrones Yanuthones Funalenones Xanthoascins Terphenyllins Austdiols Candidusins Paspalinins Asperazines acids Secalonic Emodins Nigragillins Corymbiferan lactones Aspernomine Asperflavins Neopyranonigrins Pyrophens Dehydrocarolic acid Sclerotionigerins Decaturins A. brunneoviolaceus X X X X A. lacticoffeatus X X X X X X X A. uvarum X X X X X A. niger NRRL3 X X X X X X X X A. ellipticus X X X X X A. saccharolyticus X X A. tubingensis X X X X X A. aculeatus X X X X A. piperis X X X A. niger CBS 513.88 X X X X X X X X A. neoniger X X X X X X X X A. violaceofuscus X X X X A. brasiliensis X X X X X X A. eucalypticola X X X X A. luchuensis CBS 106.47 X X X X A. sclerotiicarbonarius X X X A. costaricaensis X X X X X A. japonicus X X A. heteromorphus X X X A. aculeatinus X X X X X X X A. ibericus X X X X X A. sclerotioniger X X X X X X A. vadensis X X X X X A. fijiensis X X X X X X X X A. homomorphus X X X X A. niger ATCC 1015 X X X X X X X X A. indologenus X X

144 A.2 Supplementary information: Uncovering bioactive compounds in Aspergillus sec- tion Nigri by genetic dereplication using secondary metabolite gene cluster net- works

145 Supporting Information

Theobald et al. 10.1073/pnas.XXXXXXXXXX

Supporting Information (SI)

Similarity

75

50

25

● Subgroup ● ● ● Biseriates ● ● Flavi ● ● ● Penicillium Patulin● ● Uniseriates ● ● Aculins● ● SMGC family ● 30 ● 93 123 Yanuthone D 261

Fig. S1. Network of cluster similarities with all gene clusters connected to the patulin gene cluster. Similarity is indicated by edge color, clade by node color and annotated family by shape. The network shows four gene cluster families associated with the patulin gene cluster. Annotated gene clusters lead to the conclusion that all gene clusters in the network synthesize products based on 6-methylsalicylic acid.

A ellipticus

A niger NRRL3

A sclerotiicarbonarius A brasiliensis

A brasiliensis

0 kb 20 kb

10 kb

Fig. S2. Gene clusters which have been removed from predicted malformin gene cluster family. The gene clusters shown in this figure were only assigned to the family by greedy prediction or similarity to irrelevant parts of the target malformin gene cluster.

SI Figures. 146 Theobald et al. 10.1073/pnas.XXXXXXXXXX 1 of5 Fig. S3. A akuA ∆ strain verification. The figure shows the PCR verification of the akuA (Aspbr1_0077313) deletion strain. In each lane (1-4) contains expected approximate band length, and sizes are compared to 1 kb ladder (L). Lane 1 shows reaction 1 (D) with primers P15+P16 (Table S2) on the wild-type strain, producing a band of 2.7 kb. After transformation of the pyrG1 strain with akuA gene-targeting DNA construct, the correct integration at the akuA locus was tested by reaction 3 (primers P17+P18), yielding a band of 4.0 kb (lane 2), verifying the integration of the AFLpyrG marker into the akuA locus, thus, this strain is akuA∆::AFLpyrG. Lane 3 and 4 show reaction 1 respectively on akuA∆::AFLpyrG (3.3 kb) and the akuA∆ strain (0.8 kb) created after pop-out recombination of AFLpyrG. The marker excision band (0.8 kb) and a product of approximately 6 kb are noticeable in lane 3, which we ascribe to PCR products formed from a direct repeat annealing during PCR. B mlfA∆ strain verification. The figure shows an example of the PCR verification for one of the mlfA (Aspbr1_34020) deletion strains (BRA30, see Table S1). The agarose gel is divided into three sections, each showing two lanes representing a specific reaction on first the wild-type strain (lanes 1, 3, 5) and secondly the mlfA∆::AFLpyrG (lanes 2, 4, 6). The four L lanes show the 1 kb ladder. The expected approximate band length is added to lanes 2, 4 and 5, whereas in lanes 1, 3 and 6 no band is expected. The first two sections show the outcome from reaction 3 (primers P19+P24) and 4 (primers P22+P23), respectively. Lanes 1 and 3 represent PCR on the wild-type strain where AFLpyrG is not present in the mlfA locus, thus no bands are seen. Reactions on the mlfA∆::AFLpyrG mutant (lanes 2+4) show the expected products of 4.1 kb and 3.1 kb, respectively, verifying correct targeted integration. The third and last section with lanes 5 and 6 revealed by reaction 2 (primers P20+P21) that mlfA was only present in wild-type strain (lane 5) but not in mlfA∆::AFLpyrG (lane 6), where two faint unspecific smaller products was observed. C Verification of mlfA-Oex strains. Lanes 1-3 and lanes 4-6 respectively show the results by agarose gel electrophoresis after PCR reaction 3 and 4 (primers P27+P28 and P29+P30), with the internal primers binding in the integrated mlfA instead of the pyrG marker. The mlfA-Oex strains gave the expected bands for both reaction 3 (lane 1+2; 3.7 kb) and reaction 4 (lane 4+5; 4.1 kb) confirming targeted integration of the cassette for both ends at the integration site (IS1) locus. Wild-type controls did not show any product (lane 3 and 6). Testing for presence of untransformed wild-type nuclei was accomplished by reaction 5 (primers P25+P26) in lanes 7-9, where no band is seen for the mlfA-Oex strains (lanes 7+8; expected band length 22kb) and a 2.0 kb band for the wild-type control (lane 9), confirming both strains to be homokaryotic mutants in the IS1 locus. D Setup for verification of mutant strains. The PCR strategy employed ensures verification of two aspects. Firstly, the deletion cassettes with the pyrG marker has been correctly integrated in the targeted locus, gene of interest (GOI), and secondly, no untransformed wild-type nuclei are present in the mutant strains. All the mutant strains analyzed were compared to reference gDNA. Specifically, reaction 1 amplifies the target coding sequence from primers placed into the proximal targeting sequences (Up and Down) applied for homologous recombination. The main application of reaction 1 is to show if the strain is heterokaryon (has wild-type nuclei), I, while it also indicates whether the marker has replaced the target, II. Moreover, it is efficient to show whether the pyrG marker is excised after counter-selection on 5-FOA, III. Reaction 2 is used for very large stretches of DNA, I, that are difficult to amplify in reaction 1, as in the case of mlfA. Reactions 3 and 4 validate correct targeting by employing primers binding in unique locations outside the targeting sequences that each form a pair with primers internally in the marker gene, IV. Reaction 5 may likewise validate correct targeting by the one primer binding in unique location outside the targeting sequence, while also determining if the strain is heterokaryon as it provides a band even for wild-type strains since the second primer is located in the targeting sequence instead of the marker gene, V.

SI Tables.

147 2 of5 Theobald et al. 10.1073/pnas.XXXXXXXXXX Table S1. Strains used in this study

Strain ID Genotype Source BRA1 Wild-type (71) BRA6 pyrG1 This study BRA9 pyrG1, akuA∆::AFLpyrG This study BRA10 pyrG1, akuA∆ This study BRA30 pyrG1, akuA∆, mlfA∆::AFLpyrG This study BRA44 pyrG1, akuA∆, mlfA∆::AFLpyrG This study BRA52 pyrG1, akuA∆, mlfA∆ This study BRA67 pyrG1, akuA∆, mlfA∆, IS1::PgpdA-mlfA-TtrpC-DR-AFUMpyrG-DR This study BRA68 pyrG1, akuA∆, mlfA∆, IS1::PgpdA-mlfA-TtrpC-DR-AFUMpyrG-DR This study

Nomenclature for the TSs refers to the numbered species in the table. AFUM: Aspergillus fumigatus, AFL: A. flavus

148 Theobald et al. 10.1073/pnas.XXXXXXXXXX 3 of5 Table S2. List of primers used in this study

Code Name Sequence (5’-3’) P1 gRNA-ABRApyrG-rv AGCTTACUCGTTTCGTCCTCACGGACTCATCA- GGCCTCTCGGTGATGTCTGCTCAAGCG P2 gRNA-ABRApyrG-fwd AGTAAGCUCGTCGCCTCTTGGCAAGAGCATTGG- TTTTAGAGCTAGAAATAGCAAGTTAAA P3 gRNA-ABRAakuA-rv AGCTTACUCGTTTCGTCCTCACGGACTCATCAG- GATGAACGGTGATGTCTGCTCAAGCG P4 gRNA-ABRAakuA-fwd AGTAAGCUCGTCGATGAAGATGAAGACAGTCG- GTTTTAGAGCTAGAAATAGCAAGTTAAA P5 PgpdA-pac-up-fwd GGGTTTAAUGCGTAAGCTCCCTAATTGGC P6 TtrpC-short-pac-dw-rv GGTCTTAAUGAGCCAAGAGCGGATTCCTC P7 ABRAakuA-Dl-Up-FU GGGTTTAAUGACTGGTGAGTGATTGTGGGAG P8 ABRAakuA-Dl-Up-RU GGACTTAAUGTGGGGCTGCTGTGTCTG P9 ABRAakuA-Dl-Dw-FU GGCATTAAUGCATAATAGGATGGTCGGGTTCG P10 ABRAakuA-Dl-Dw-RU GGTCTTAAUGCCTTGGTAGGCAGACGAATAG P11 ABRA34020-Dl-Up-FU GGGTTTAAUGAGGGAGTGAAAGTCGTCG P12 ABRA34020-Dl-Up-RU GGACTTAAUGAAGGAGGTGAAGTTACTAGGAC P13 ABRA34020-Dl-Dw-FU GGCATTAAUGTGGTTGCCATGAGTGAAAGTG P14 ABRA34020-Dl-Dw-RU GGTCTTAAUGATGCAGTGCAGGTTGGAG P15 ABRAakuA-ChkGap-F CCAGCCAGCGTCATCAATTAC P16 ABRAakuA-ChkGap-R CCTACCCCGACATCCAACC P17 ABRAakuA-ChkUP-F CCTTCCCAGCTCTCAAGTCC P18 AFLpyrG-Chk-int-R2 CTAGATCACATGTAAGTGGCATCCC P19 ABRA34020-Chk-Up-F CCGTCCGAAGATCAATCCGAC P20 ABRA34020-Chk-Gap-F GCACGTCGGCTGTGATGG P21 ABRA34020-Chk-Int5’-R GAGTGTGATCTGGATTCGGG P22 ABRA-34020-ChkDw-R CCCACCATTTGTAACGCACTG P23 AFLpyrG-Chk-Dw-F CCCACCACCCCCTACTCTAACAC P24 AFLpyrG-Int-R-BP CCCATCACAAACTTCTTATACTTCCGA P25 ABIS3-ChkDw-R CACCACAGCCATCAAATCC P26 ABIS3-ChkGap-F CGACAACCCCAACAACG P27 ABIS3-Chk-Up-F2 GGTGTCTGTGTGTGACAGACTAG P28 ABRA34020-Chk-Int5’-R GAGTGTGATCTGGATTCGGG P29 ABIS3-Chk-Dw-R2 GTGTCTTTGGTACCATGGGAGAG P30 BRAmalA-GapChk-F GAAGTGTGGGGTTGATGTG P31 ABRA-IS-Up-FU2 GGGTTTAAUGGCTTCTCCATCCTTCTACATG P32 ABRA-IS-Up-RU2 AGGGAAUTGTAGGTAAACCAGCACGCTC P33 ABRA-IS-Dw-FU GGCATTAAUGATGAGCATATTGTGTGAGG P34 ABRA-IS-Dw-RU GGTCTTAAUGGGATGGGTATGTTGGA P35 PgpdA(2.3kb)-FU-ANIS5-TSI ATTCCCUTGTATCTCTACACACAGGC P36 PgpdA-Rev-RU PacI.Reg AAACCCUCAGCGCGGTAGTGATGTCTGCTCAAG P37 TtrpC-PacI.Reg-FU AGGGTTUAATTAAGACCTCAGCGGATCCACTTAACGTTAC P38 TtrpC-RU-PacI GGACTTAAUCGCTTACACAGTACACGAG P39 ABRA34020-F1-FU-PacI 2 GGGTTTAAUATGAGTCGCTTTTCCTGC P40 ABRA34020-F1-RU 2 AGCCCCTGUCCGTGTGTT P41 ABRA34020-F2-FU ACAGGGGCUGTTCAACACG P42 ABRA34020-F2-RU AGCGAGAUGCGGACAGG P43 ABRA34020-F3-FU 2 ATCTCGCUCGGTGACCAAG P44 ABRA34020-F3-RU 2 ACTCCGUCAACTGGAACATCTC P45 ABRA34020-F4-FU 2 ACGGAGUGGAGATGATGGTC P46 ABRA34020-F4-RU-PacI GGTCTTAAUTCAAACACAGACACCCCGAG

P1-6 CRISPR/cas9 construction, P7-14 deletion plasmid construction, P15-30 strain verification. P31-46 Integration site (IS) and mlfA complementation. 149 4 of5 Theobald et al. 10.1073/pnas.XXXXXXXXXX Table S3. List of plasmids used in this study

Plasmid ID Application Content pFC334 (44) CRISPR sgRNA template argB, Ptef1 -cas9 -Ttef1, PgpdA-sgRNA-TtrpC pFC332 (44) CRISPR/cas9 vector hygR, Ptef1 -cas9 -Ttef1 pFC837 CRISPR/cas9 pyrGb hygR, Ptef1 -cas9 -Ttef1, PgpdA-sgRNA(pyrG)-TtrpC pFC715 CRISPR/cas9 akuAc hygR, Ptef1 -cas9 -Ttef1, PgpdA-sgRNA(akuA)-TtrpC pFC478 (44) Gene deletion vector DR-AFLpyrG-DR pFC645 Deletion plasmid akuA UpakuA-DR-AFLpyrG-DR-DownakuA pFC809 Deletion plasmid mlfA UpmlfA-DR-AFLpyrG-DR-DownmlfA pFC3 (43) Base overexpression vector DR-AFUMpyrG-DR pFC1116 mlfA overexpression UpIS1-PgpdA-mlfA-TtrpC-DR-AFUMpyrG-DR-DownIS1

b: Protospacer for targeting A. brasiliensis pyrG is GCCTCTTGGCAAGAGCATTG, c: Protospacer for targeting A. brasiliensis akuA is GATGAAGATGAAGACAGTCG. All plasmids contain the ampR and AMA1 sequence for propagation in E. coli and Aspergilli, respectively.

150 Theobald et al. 10.1073/pnas.XXXXXXXXXX 5 of5 A.3 Supplementary information: Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli

151 Supplementary information for: Genus level analysis of PKS-NRPS and NRPS-PKS hybrids reveals their origin in Aspergilli

January 31, 2018

152

1 A C E Aspka1_1_15585 Asptu1_137648 Aspneo1_404067 Aspfo1_203524 Aspcos1_267997 Aspvad1_384352 Asppip1_499150 Aspscl1_605326 Aspfo1_83090 Aspeuc1_454486 Asppip1_504170 Aspka1_1_21681 Aspcos1_286633 Aspeuc1_444816 Aspneo1_456545 Aspph1_233459 Aspni7_1128344 Asptu1_67116 Aspwel1_177549 Aspca3_505925 Aspvad1_379300 Aspscl1_571530 Aspph1_302357 Aspibe1_469507 Aspsac1_376297 Aspscle1_426029 Aspni7_1112058 Aspfij1_516001 Aspbr1_191248 Aspvio1_417425 Aspibe1_412395 Aspph1_313881 Aspph1_308983 Aspni7_1111323, Pyranonigrin E Aspni7_1122199 Aspwel1_184852 Aspwel1_169928 Aspeuc1_450878 Aspbr1_190761 Aspste1_411931 Aspste1_463238 Asphet1_337005 Aspph1_308144 Aspell1_469246 ●Aspvad1_359092 Aspni7_1087173 72●Aspeuc1_309019 44●Asptu1_34465 Asptu1_197164 44●Aspcos1_287415 Aspbr1_211676 ●Aspneo1_417997 ●Aspph1_298616 Aspind1_474999 ●Aspni7_1115863 Aspvio1_446723 ●Aspwel1_156748 Pench1_7144 ●Aspka1_1_20326 Aspuva1_318160 90 ●Aspfo1_48698 ●Asppip1_439863 Asphet1_343322 ●Aspste1_418130 Aspfu1_9205, Pseurotin A ●Aspka1_1_13552 92 ●Aspfo1_107916 Aspfu_A1163_1_107529 ●Asppip1_347659 Aspcl1_722 ●Pench1_85311 85 ●Aspind1_482416 Aspnov1_434415 ●Aspcam1_323099 Aspcl1_6366, Cytochalasin Aspca3_5640 Aspste1_507248 Aspca3_5570 Aspte1_325, Isoflavipucine Aspscl1_620660 Aspscle1_1697 Aspcam1_260046 Aspste1_429736

1 2 3 1 2 3 2 4 6

B D Aspka1_1_16681 Aspfij1_516563 Aspfo1_192225 Section/ subgroup 83 Aspeuc1_406874 85 Aspbru1_499169 Aspvad1_381647 Candidi Aspcos1_272872 Aspacu1_375544 Circumdati Aspneo1_435336 Aspac1_31185 Asppip1_472204 Clavati Aspph1_236716 Aspind1_455514 Aspni7_1099903 Flavi Aspwel1_181539 Aspvio1_457097 Asptu1_205634 Aspuva1_425877 Fumigati Aspibe1_443386 Aspscle1_422284 94 Asphom1_462388 Nidulantes Aspcl1_7331 Niger_biseriates Asphom1_370420 Aspnov1_454988 Aspibe1_469268 Aspfl1_36768, Cyclopiazonic acid Niger_uniseriates Aspvad1_474678 Aspneo1_445177 Aspsac1_388526 Ochraceorosei Asptu1_133959 85 Penicillium 86 Aspcos1_255250 Aspste1_512973 64 Aspeuc1_453307 Aspor1_2284 Terrei Aspfo1_142878 82 Asppip1_508137 Aspfl1_25756 92 Aspka1_1_21544 Orientation Aspph1_303802 Aspste1_392794 66 Aspni7_1170655 Aspnov1_507550 ● N−type Aspwel1_181897 86 Aspbr1_191763 Aspvio1_473061 Aspor1_11154 P−type Aspfl1_33626 Aspuva1_194565 Aspste1_410869 85 Aspind1_366949 Aspph1_331751 Aspni7_1188722 Aspfij1_421171 Aspwel1_156573 Asphom1_397180 Aspbru1_489726 Aspbr1_126737 Aspcl1_6268 Aspacu1_429294 Aspibe1_400692 Aspac1_44810 Aspscle1_380544 86 Aspcam1_310784 Aspca3_511653 Aspoch1_492959 89 Aspibe1_480497 Aspscl1_621233 Aspscle1_441799 Aspnov1_448608, Chaetoglobosin Aspoch1_484695 Aspfu_A1163_1_103645 Aspnid1_9705, Aspyridone Aspste1_454498 Aspste1_477231 Aspscle1_361763

1 2 3 1 2 3

Figure 1: Selected nodes from hybrid maximum liklihood phylogeny. Subtrees were extracted from hybrid tree figure 1. Sections and species groups indicated by tip color; Orientation of hybrids N-type (NRPS-PKS) and P-type (PKS-NRPS) indicated by tip shape. Tip labels constist of jgi organism identifier, jgi protein id and associated compound (if applicable).

153

2 genus ● Trichophyton equinum CBS 127.97, EGE01982.1 ● Aspergillus ● Trichophyton tonsurans CBS 112818, EGD97139.1 ● Trichophyton interdigitale MR816, KDB22226.1 ● Dothideomycetes ● Trichophyton verrucosum HKI 0517, XP_003020763.1 ● Trichophyton benhamiae CBS 112371, XP_003014124.1 ● Eurotiomycetes ● Nannizzia gypsea CBS 118893, XP_003176907.1 ● Arthroderma otae CBS 113480, XP_002850891.1 ● Exobasidiomycetes ● Metarhizium robertsii ARSEF 23, XP_007824811.1 ● 76● Metarhizium anisopliae, KFG82172.1 Leotiomycetes 90● Metarhizium brunneum ARSEF 3297, XP_014543166.1 ● Orbiliomycetes ● Metarhizium guizhouense ARSEF 977, KID83603.1 ● Metarhizium acridum CQMa 102, XP_007815889.1 ● Penicillium ● A. piperis, Asppip1_347659 ● A. luchuensis IFO 4308, Aspka1_1_13552 ● Planctomycetes ● P. chrysogenum, Pench1_85311 ● A. indologenus, Aspind1_482416 ● Proteobacteria ● Glonium stellatum, OCL14727.1 ● A. neoniger, Aspneo1_417997 ● Sordariomycetes ● 60A. costaricaensis, Aspcos1_287415 ● ● A. vadensis, Aspvad1_359092 Terrabacteria group 52 ● A. tubingensis, Asptu1_34465 ● Undefined 81● A. eucalypticola, Aspeuc1_309019 ● A. welwitschiae, Aspwel1_156748 ● Xylonomycetes ● A. niger ATCC 1015, Aspni7_1115863 ● A. piperis, Asppip1_439863 93 ● A. luchuensis CBS 106.47, Aspfo1_48698 79 ● Hirsutella minnesotensis 3608, KJZ68392.1 sm_short ● A. steynii, Aspste1_418130 78 ● Hirsutella minnesotensis 3608, KJZ72673.1 ● HYBRID ● Ophiocordyceps unilateralis, KOM20203.1 ● Pestalotiopsis fici W106−1, XP_007834561.1 NRPS ● Eutypa lata UCREL1, XP_007798173.1 ● Madurella mycetomatis, KXX74579.1 62 ● Xylona heveae TC161, XP_018192493.1 81 ● Cenococcum geophilum 1.58, OCL00726.1 ● Talaromyces cellulolyticus, GAM40300.1 ● Streptomyces sp. NRRL S−4, KPC78190.1 ● Streptomyces sp., APD71785.1 ● Streptomyces thermolilacinus, WP_023586037.1

3 4 5 6

Figure 2: ML phylogeny of fungal and bacterial hybrids. Subtree extracted from figure 6. Tip labels show species name and NCBI identifier/ JGI organism and protein identifier. Tip color indicates genus or class; Tip shape indicates SM protein type. Bootstrap values under 95 are shown in red. Hybrids from Streptomyces form a sister clade to fungal hybrids NRPS-PKS hybrids, indicating this class from bacterial origin.

154

3 genus ● Trichophyton rubrum CBS 118892, XP_003239193.1● 59● Trichophyton rubrum CBS 735.88, EZG01159.1 Aspergillus ● 63Trichophyton violaceum, OAL75420.1 ● ● Trichophyton benhamiae CBS 112371, DAA76613.1 Dothideomycetes ● Trichophyton equinum CBS 127.97, EGE07077.1 ● 64● Trichophyton interdigitale H6, EZF33994.1 Eurotiomycetes ● Trichophyton tonsurans CBS 112818, EGD93407.1 ● Nannizzia gypsea CBS 118893, XP_003176265.1 ● Exobasidiomycetes ● Arthroderma otae CBS 113480, XP_002847685.1 ● 91 ● A. steynii, Aspste1_463238 Leotiomycetes ● Glarea lozoyensis ATCC 20868, XP_008085943.1Xenolozoyenone 62 ● Pseudogymnoascus sp. 24MN13, OBT53708.1 ● Orbiliomycetes 61 ● Endocarpon pusillum Z07020, XP_007802712.1 ● Byssochlamys spectabilis No. 5, GAD98579.1 ● Penicillium ● Coniosporium apollinis CBS 100218, XP_007781409.1 ● A. welwitschiae, Aspwel1_184852 ● Planctomycetes ● A. phoenicis, Aspph1_308983 ● A. niger ATCC 1015, Aspni7_1111323 Pyranonigrin E ● Proteobacteria ● A. eucalypticola, Aspeuc1_450878 ● A. brasiliensis, Aspbr1_190761 ● Sordariomycetes ● A. piperis, Asppip1_504170 ● A. luchuensis IFO 4308, Aspka1_1_21681 ● Terrabacteria group ● A. tubingensis, Asptu1_137648 82● ● 53A. eucalypticola, Aspeuc1_444816 Undefined 92● A. neoniger, Aspneo1_404067 ● A. vadensis, Aspvad1_384352 ● Xylonomycetes ● A. costaricaensis, Aspcos1_267997 ● A. welwitschiae, Aspwel1_177549 ● A. phoenicis, Aspph1_233459 ● A. sclerotioniger, Aspscl1_571530 sm_short ● A. carbonarius, Aspca3_505925 ● A. ibericus, Aspibe1_469507 ● HYBRID ● A. sclerotiicarbonarius, Aspscle1_426029 ● A. violaceofuscus, Aspvio1_417425 NRPS ● A. fijiensis, Aspfij1_516001 ● A. ibericus, Aspibe1_412395 ● Macrophomina phaseolina MS6, EKG19785.1 ● Cordyceps confragosa, OAR00925.1 ● Cordyceps confragosa RCEF 1005, OAA78753.1 ● Isaria fumosorosea ARSEF 2679, XP_018702131.1 ● Cordyceps militaris CM01, XP_006673220.1 ● Cordyceps militaris, AFK23390.1 ● Torrubiella hemipterigena, CEJ91944.1 ● A. ellipticus, Aspell1_469246

2 3 4 5

Figure 3: ML phylogeny of pyranonigrin associated hybrids. Subtree extracted from figure 6. Tip labels show species name and NCBI identifier/ jgi organism and protein identifier. Tip color indicates genus or class; Tip shape indicates SM protein type. Bootstrap values under 95 are shown in red. Additional tip label shows associated compound (if applicable).

155

4 A.4 Supplementary information: Compara- tive genomics of A. nidulans and section Nidulantes

156 clade ● A. aurantiobrunneus ● A. unguis ● A. varians ● A. terreus 3427 ● clade_I ● A. niger ATCC 1015 1180842 100● ● 100A. aculeatinus 396729 clade_II ● A. fumigatus Af293 6208 99● A. clavatus 456 ● clade_III 98● A. ochraceoroseus 501719 ● ● A. karnatakaensis 99742 clade_IV ● A. amoeneus 201492 100● ● 99A. versicolor 143193 references 99● A. sydowii 50634 ● A. aurantiobrunneus 181502 ● A. spectabilis 176530 ● A. crustosus 367593 ● A. stella−maris 174927 99● A. filifera 166866 10099● A. venezuelensis 373864 ● A. olivicola 205876 ● A. undulatus 260666 96● A. varians 161990 ● A. unguis 141527 ● A. multicolor 156303 ● A. recurvatus 176526 ● A. stercoraria 240960 100● A. desertorum 133362 ● A. fructiculosus 206467 99● A. falconensis 151843 ● A. tetrazonus 178090 ● A. nidulans 1633 aspC ● A. acristatulus 189267 ● A. neoechinulatus 164326 100● A. similis 181597 ● A. indicus 169534 ● A. foveolata 88351 ● A. navahoensis 203597 ● A. olivicola 182929 ● A. filifera 162898 ● A. stella−maris 183931 99● A. venezuelensis 13497 ● A. undulatus 47073 ● A. karnatakaensis 198846 ● A. aurantiobrunneus 220647 ● A. unguis 87081 ● A. multicolor 167436 ● A. spectabilis 170228 ● A. varians 190334 ● A. versicolor 37124 100● A. amoeneus 209217 ● A. crustosus 379277 ● A. neoechinulatus 166926 ● A. foveolata 151470 ● A. nidulans 5274 aspD ● A. falconensis 171425 99● A. fructiculosus 117202 ● A. sydowii 52968 ● A. acristatulus 180309 ● A. recurvatus 164342 ● A. tetrazonus 19741 ● A. similis 184604 ● A. indicus 173605 ● A. stercoraria 173174 99● A. desertorum 137408 ● A. navahoensis 219575 ● A. ochraceoroseus 452725 ● A. aculeatinus 460251 ● A. clavatus 7653 100● A. fumigatus Af293 752 ● A. niger ATCC 1015 1098716 ● A. terreus 2643 ● 100A. amoeneus 91272 100● A. versicolor 141680 ● A. sydowii 34958 ● A. undulatus 267149 ● A. varians 332135 ● A. aurantiobrunneus 218715 96 ● A. venezuelensis 367792 97● A. stella−maris 20389 96● A. filifera 203598 ● A. olivicola 172032 ● 100A. crustosus 387357 100● A. spectabilis 146808 ● A. karnatakaensis 199684 ● A. unguis 166476 ● A. multicolor 185259 ● A. nidulans 4747 aspB ● A. foveolata 136768 ● A. similis 180226 ● A. recurvatus 192447 ● A. navahoensis 238983 ● A. indicus 25862 ● A. neoechinulatus 183107 ● A. tetrazonus 29842 100● A. acristatulus 105037 98● A. fructiculosus 211778 100● A. falconensis 148803 99● A. stercoraria 270251 ● A. desertorum 90807 99● A. ochraceoroseus 504584 100 ● A. aculeatinus 254273 100● A. niger ATCC 1015 1173846 ● A. clavatus 3448 100● A. fumigatus Af293 8940 ● A. terreus 9986 ● A. fructiculosus 174309 100● A. falconensis 142602 99● A. recurvatus 90731 ● A. navahoensis 94341 ● A. stercoraria 83671 100● A. desertorum 74088 ● A. tetrazonus 36917 96● A. nidulans 335 aspE 100● A. foveolata 111843 98100● A. acristatulus 20357 100● A. similis 170347 ● A. indicus 189788 ● A. multicolor 127838 ● 99A. stella−maris 87587 100● A. olivicola 153657 ● A. undulatus 199421 ● A. sydowii 87807 97● A. unguis 141252 ● A. spectabilis 192595 100 ● A. clavatus 5498 99● A. fumigatus Af293 3852 ● A. terreus 2858 ● A. ochraceoroseus 405099 ● A. aurantiobrunneus 175484 ● A. sydowii 59428 ● A. fumigatus Af293 6646 99● A. clavatus 2381 97● A. terreus 9338 ● A. aculeatinus 37896 ● A. niger ATCC 1015 1151441 ● A. ochraceoroseus 472320 ● A. amoeneus 195965 ● A. versicolor 53761 ● A. undulatus 271226 ● A. venezuelensis 234698 ● A. olivicola 194166 ● A. filifera 163854 ● A. varians 315270 ● A. stella−maris 185780 ● A. unguis 133145 ● 100A. crustosus 170832 98● A. spectabilis 162208 ● A. karnatakaensis 127244 ● A. multicolor 187733 ● A. neoechinulatus 152282 ● A. falconensis 187196 ● A. fructiculosus 187766 ● A. navahoensis 46330 ● A. recurvatus 153849 ● A. stercoraria 35276 ● A. desertorum 125285 ● A. indicus 185771 ● A. similis 186811 ● A. acristatulus 127487 ● A. foveolata 130647 ● A. nidulans 495 aspA ● A. tetrazonus 103091 0.9

Fig. S1. Maximum likelihood phylogeny of septin homologues 157 2 of 11 Theobald et al. 10.1073/pnas.XXXXXXXXXX ●A. navahoensis 36627 ●A. falconensis 149548 ●A. indicus 176797 ●A. desertorum 76321 ●A. similis 95390 ●A. stercoraria 242396 ●A. fructiculosus 178764 ●A. recurvatus 160072 ●A. nidulans 1908 nimX ●A. foveolata 135292 ●A. tetrazonus 138293 ●A. neoechinulatus 88240 ●A. acristatulus 64828 ●A. multicolor 8973 ●A. aurantiobrunneus 188833 ●A. unguis 9300 ●A. undulatus 250390 96●A. stella−maris 173388 ●A. filifera 110955 ●A. venezuelensis 392124 ●A. olivicola 227023 ●A. varians 285260 ●A. versicolor 44987 100●A. amoeneus 195073 ●A. sydowii 154295 ●A. karnatakaensis 227527 ●A. spectabilis 18401 ●A. crustosus 374113 ●A. ochraceoroseus 109745 ●A. aculeatinus 340042 ●A. niger ATCC 1015 1127329 ●A. clavatus 8615 ●A. fumigatus Af293 7874 ●A. terreus 9420 100 ●A. versicolor 764564 96●A. amoeneus 235913 ●A. sydowii 149497 ●A. spectabilis 172680 ● 99A. desertorum 89003 99●A. stercoraria 31658 ●A. navahoensis 10384 ●A. recurvatus 172533 ●A. indicus 10732 ●A. nidulans 1739 phoA ●A. tetrazonus 168729 ●A. foveolata 139765 100●A. fructiculosus 180138 100●A. falconensis 151605 ●A. multicolor 186120 ●A. neoechinulatus 119157 ●A. similis 191306 ●A. acristatulus 57251 ●A. varians 315721 ●A. aurantiobrunneus 188236 clade ● 100A. amoeneus 59997 100●A. versicolor 385057 ● A. aurantiobrunneus ●A. sydowii 161723 ● ● A. unguis 98A. venezuelensis 366708 99●A. stella−maris 154813 ●A. filifera 186953 ● A. varians ● 98A. olivicola 225870 ● ●A. undulatus 117649 clade_I 97 ● 100100A. crustosus 390046 ● clade_II 100●A. karnatakaensis 205043 100 ●A. spectabilis 166366 ● clade_III ●A. unguis 164821 ●A. ochraceoroseus 479529 ● clade_IV ●A. fumigatus Af293 6312 9999●A. clavatus 354 ● references 98●A. terreus 2139 99 ●A. niger ATCC 1015 1155842 ●A. aculeatinus 388804 ●A. niger ATCC 1015 36108 ●A. unguis 153320 ●A. ochraceoroseus 470440 ●A. niger ATCC 1015 1187550 ● 99 A. crustosus 397678 100●A. spectabilis 161282 ●A. karnatakaensis 44039 ●A. olivicola 193595 98 ●A. stella−maris 162369 96●A. venezuelensis 370092 98●A. filifera 165706 96●A. undulatus 253937 ●A. aurantiobrunneus 51426 ● 100A. versicolor 68978 100●A. amoeneus 200090 ●A. sydowii 141635 ●A. varians 187431 ●A. multicolor 174155 ●A. nidulans 7560 phoB ●A. foveolata 117680 98●A. acristatulus 194724 ●A. tetrazonus 199340 ●A. neoechinulatus 159574 ●A. stercoraria 25131 100●A. desertorum 129324 100●A. similis 187678 ●A. indicus 5130 ●A. recurvatus 138795 ●A. navahoensis 223813 ●A. falconensis 168352 ●A. fructiculosus 102525 0.9

Fig. S2. Maximum likelihood phylogeny of nimX, phoA, phoB homologues

158 Theobald et al. 10.1073/pnas.XXXXXXXXXX 3 of 11 clade ● A. aurantiobrunneus ● A. unguis ● A. varians ● clade_I ● A. desertorum 53 ● A. falconensis 104137 ● ● A. fructiculosus 22678 clade_II ● A. recurvatus 194631 ● A. acristatulus 61420 ● clade_III ● A. multicolor 67678 ● A. tetrazonus 67366 ● clade_IV ● A. neoechinulatus 90974 ● A. stercoraria 42737 ● references ● A. navahoensis 231773 ● A. foveolata 61343 ● A. similis 38955 ● A. indicus 127712 96● A. aurantiobrunneus 85238 97● A. unguis 127158 ● A. nidulans 3302 modA ● A. stella−maris 74583 100● A. filifera 191231 ● A. venezuelensis 202800 96● A. varians 70675 ● A. undulatus 126529 ● A. olivicola 62220 ● A. versicolor 85914 ● A. amoeneus 122256 ● A. sydowii 58457 ● A. ochraceoroseus 510065 96● A. spectabilis 181987 ● A. karnatakaensis 145928 ● A. crustosus 15570 ● A. clavatus 789 ● A. fumigatus Af293 2196 ● A. terreus 10326 100● A. aculeatinus 194244 99 ● A. niger ATCC 1015 52477 ● A. varians 357983 ● A. stella−maris 66981 ● A. filifera 195919 ● A. venezuelensis 376260 ● A. unguis 19833 ● A. undulatus 202713 ● A. olivicola 199597 ● A. aurantiobrunneus 78549 ● A. varians 120122 ● A. sydowii 822203 100● A. amoeneus 219006 ● A. versicolor 586268 ● A. multicolor 149941 ● A. crustosus 389190 99● 97A. spectabilis 178199 100 ● A. karnatakaensis 157524 ● A. stercoraria 245376 ● A. desertorum 141530 ● A. nidulans 410 racA ● A. similis 103397 ● A. foveolata 169162 ● A. navahoensis 214679 ● A. recurvatus 88865 ● A. tetrazonus 38491 ● A. acristatulus 18333 98● A. indicus 165142 ● A. falconensis 33381 100● A. fructiculosus 168315 ● A. neoechinulatus 152031 ● A. ochraceoroseus 515967 98 ● A. niger ATCC 1015 1147254 ● A. aculeatinus 10173 ● A. terreus 2936 ● A. clavatus 5427 ● A. fumigatus Af293 3778 ● A. nidulans 9493 rho4 ● A. tetrazonus 26867 100● A. neoechinulatus 163561 ● A. foveolata 106345 100● A. similis 21888 ● A. acristatulus 144955 ● A. unguis 88150 ● A. recurvatus 36416 ● A. indicus 18058 ● A. navahoensis 144734 ● A. fructiculosus 65183 ● A. falconensis 27954 ● A. desertorum 40525 99● A. stercoraria 37588 ● A. multicolor 120861 ● A. versicolor 715463 99● 100A. amoeneus 117583 ● A. sydowii 91456 ● A. varians 49211 ● A. aurantiobrunneus 189701 ● A. fumigatus Af293 7175 100● A. clavatus 2868 100● A. aculeatinus 213058 96100● A. niger ATCC 1015 1131105 ● A. terreus 9678 ● A. ochraceoroseus 500141 ● A. spectabilis 156747 ● A. crustosus 136360 ● A. karnatakaensis 222397 ● A. undulatus 180175 100● A. olivicola 121420 ● A. stella−maris 64707 100● 100A. venezuelensis 353540 ● A. filifera 81254 ● A. tetrazonus 14468 ● A. foveolata 129823 ● A. acristatulus 188361 ● A. recurvatus 181532 ● A. nidulans 164 ● A. falconensis 3927 ● A. unguis 156801 ● A. navahoensis 99045 ● A. multicolor 151268 100 ● A. stercoraria 153012 100● A. desertorum 111193 ● A. fructiculosus 142678 ● A. indicus 158282 ● A. similis 193545 ● A. neoechinulatus 151347 ● A. varians 53905 ● A. amoeneus 9006 99● A. versicolor 564202 ● A. sydowii 87638 ● A. aurantiobrunneus 126055 ● A. crustosus 191318 99● A. spectabilis 159980 ● A. karnatakaensis 158765 ● A. ochraceoroseus 194986 ● A. venezuelensis 353116 ● A. filifera 154576 ● A. stella−maris 195990 99● A. undulatus 281989 ● A. olivicola 59563 ● A. undulatus 248616 ● A. aculeatinus 439596 99● 100 A. niger ATCC 1015 1091425 ● A. terreus 8131 ● A. fumigatus Af293 4175 ● A. clavatus 5817 ● A. desertorum 98740 98● A. stercoraria 82442 ● A. recurvatus 89808 ● A. fructiculosus 174218 98 ● A. falconensis 152220 ● A. navahoensis 186221 ● A. indicus 165312 ● A. similis 125426 ● A. nidulans 363 99● A. tetrazonus 164355 ● A. foveolata 76772 ● A. neoechinulatus 76826 ● A. acristatulus 184165 99 ● A. varians 358382 97● A. multicolor 128962 ● A. stella−maris 67799 ● A. filifera 165017 ● A. olivicola 192740 ● A. venezuelensis 369690 ● A. unguis 20583 ● A. versicolor 582335 100● 98100A. amoeneus 13119 ● A. sydowii 816513 ● A. aurantiobrunneus 79477 ● A. undulatus 201424 ● A. spectabilis 201775 99● 99 A. karnatakaensis 169463 ● A. crustosus 80807 ● A. aculeatinus 387438 100● 10097 A. niger ATCC 1015 1128039 ● A. terreus 2889 ● A. ochraceoroseus 435074 ● A. clavatus 5466 100● A. fumigatus Af293 3820 ● A. terreus 6478 ● A. clavatus 3658 ● A. fumigatus Af293 7767 ● A. ochraceoroseus 502790 ● A. unguis 135185 ● A. aculeatinus 87714 ● A. niger ATCC 1015 1147694 ● A. crustosus 389544 99● 99A. spectabilis 185687 ● A. karnatakaensis 218279 ● A. stercoraria 160154 ● A. desertorum 136688 ● A. similis 182276 ● A. recurvatus 171122 ● A. falconensis 154397 ● A. navahoensis 234895 ● A. indicus 175241 ● A. fructiculosus 176545 ● A. neoechinulatus 149180 100● A. undulatus 289499 100● A. filifera 157425 ● A. olivicola 210444 99● A. stella−maris 180359 ● A. venezuelensis 377720 ● A. versicolor 42450 ● A. sydowii 149945 99● A. amoeneus 220796 ● A. varians 334110 ● A. multicolor 182647 ● A. aurantiobrunneus 173949 ● A. foveolata 160469 ● A. tetrazonus 169257 ● A. acristatulus 213715 ● A. nidulans 10130 rhoA 0.9 159 4 of 11 Theobald et al. 10.1073/pnas.XXXXXXXXXX

Fig. S3. Maximum likelihood phylogeny of rho4, racA, modA and rhoA homologues Number of SMGC families with clusters in x number of orgs

433

400

300

200 Number of families

100

29 1921 1316 8 6 7 3 3 3 2 1 3 2 3 3 1 4 1 1 4 4 1 0 2 0 0 2 1 2 2 1 2 0

0 10 20 30 Number of orgs in family

Fig. S4. SMGC family counts by size. Barplot showing counts of families with certain number of organisms.

160 Theobald et al. 10.1073/pnas.XXXXXXXXXX 5 of 11 clade ● A. aurantiobrunneus ●A. nidulans 6374 araR 97●A. foveolata 157874 ● A. unguis ●A. tetrazonus 167322 ●A. acristatulus 223087 ● A. varians ● A. neoechinulatus 172055 ● clade_I 99●A. indicus 179364 ●A. similis 174444 ● clade_II ●A. recurvatus 57663 100 ● clade_III ●A. navahoensis 204759 ●A. stercoraria 2987 ● clade_IV 100● 9997A. desertorum 145251 ● ● references 100A. falconensis 98328 ●A. fructiculosus 170341 ●A. multicolor 149043 ●A. unguis 154125 ● 99A. venezuelensis 310861 100●A. stella−maris 12397 100●A. filifera 168320 100●A. olivicola 177539 ●A. undulatus 76994 ● 100A. amoeneus 96345 100●A. versicolor 48640 ● 100 A. sydowii 55222 ●A. varians 75673 ●A. aurantiobrunneus 218915 97 ●A. spectabilis 66782 99 ●A. crustosus 134443 ●A. karnatakaensis 205924 ●A. ochraceoroseus 491889 ● 100 A. nidulans 866 galR 100●A. tetrazonus 183291 100●A. foveolata 15608 ●A. acristatulus 226422 ●A. similis 214636 100 ●A. indicus 200181 ●A. neoechinulatus 117456 ●A. stercoraria 285370 100 ●A. navahoensis 229573 ●A. recurvatus 199915 100 ● 100A. falconensis 198035 99 ●A. fructiculosus 194608 ●A. multicolor 189515 ● 100A. venezuelensis 401931 100 ●A. stella−maris 191072 100 ●A. filifera 205314 100 ●A. olivicola 232481 ●A. undulatus 58155 ● 100A. versicolor 54117 100 97 100 ●A. amoeneus 234973 ●A. sydowii 212810 100 ●A. varians 349436 ●A. aurantiobrunneus 208511 ● 100 100 A. karnatakaensis 228941 ●A. spectabilis 184288 ●A. ochraceoroseus 492558 ● 100 A. clavatus 8261 ●A. fumigatus Af293 138 ●A. aculeatinus 384511 ●A. niger ATCC 1015 1167351

0.0 0.5 1.0 1.5 2.0 2.5

Fig. S5. Maximum likelihood phylogeny of protein family containing A. nidulans araR and galR. Tip labels are showing species name, jgi protein id and gene name (if applicable). Proteins were aligned using clustalo (74), trimmed using trimal (75) and a ML phylogeny created using iqtree (76) using a WAG+F+I+G4 substitution model. The ML phylogeny suggests a galR homolog in all species of section Nidulantes and potentially A. ochraceroseus. 161 6 of 11 Theobald et al. 10.1073/pnas.XXXXXXXXXX clade ● A. aurantiobrunneus ●A. unguis 146683 ● A. unguis ●A. aculeatinus 123701 ● A. varians ●A. terreus 678 ● clade_I ●A. aculeatinus 409968 ● clade_II 98 ●A. niger ATCC 1015 170198 ● clade_III ●A. clavatus 7074 100 ● clade_IV ●A. fumigatus Af293 1331 ● references ●A. ochraceoroseus 507754 ●A. varians 318661 ●A. versicolor 23964

● 100A. amoeneus 223803 ●A. sydowii 25992 ●A. aurantiobrunneus 75844 ●A. stella−maris 44278 99 ● 100A. venezuelensis 320887 ●A. filifera 49142 100 ●A. foveolata 46681 100 99 ●A. olivicola 198844 ●A. undulatus 286069 ●A. unguis 152029 ●A. karnatakaensis 208221

● 97A. spectabilis 164449 ●A. crustosus 402087 ●A. indicus 99307 ●A. nidulans 5921 ●A. acristatulus 159851 ●A. tetrazonus 43578 ●A. foveolata 146647 96 ●A. multicolor 173257 ●A. similis 96916 96●A. neoechinulatus 8302 ●A. desertorum 66140 100100 ●A. stercoraria 267796 ●A. recurvatus 82070 ●A. navahoensis 178326 ●A. fructiculosus 133712 ●A. falconensis 121374

0.0 0.5 1.0 1.5 2.0 2.5

Fig. S6. Maximum likelihood phylogeny of laeA protein family. Tip labels are showing species name and jgi protein id. Proteins were aligned using clustalo (74), trimmed using trimal (75) and a ML phylogeny created using iqtree (76) using a LG+G4 substitution model (ultra fast bootstrapping (77), model finder plus (78)). Species show one copy except A. foveolata, A. aculeatinus and A. unguis, which contain another copy. 162 Theobald et al. 10.1073/pnas.XXXXXXXXXX 7 of 11 A. sydowii 80897 A. amoeneus 221408 P. chrysogenum 22458 A. nidulans 9564 penicillin−NRPS A. tetrazonus 200908 A. indicus 198163 A. fructiculosus 211881 A. falconensis 181688 A. foveolata 164622

0.1

Fig. S7. ML phylogeny of penicillin NRPS homologs. Proteins were aligned using clustalo (74) and ML phylogeny created using iqtree (76) with a WAG+F+G4 substitution model (ultra fast bootstrapping (77), model finder plus (78))

163 8 of 11 Theobald et al. 10.1073/pnas.XXXXXXXXXX clade ● A. aurantiobrunneus ● A. unguis ● A. varians ● A. terreus 3427 ● clade_I ● A. niger ATCC 1015 1180842 100● ● 100A. aculeatinus 396729 clade_II ● A. fumigatus Af293 6208 99● A. clavatus 456 ● clade_III 98● A. ochraceoroseus 501719 ● ● A. karnatakaensis 99742 clade_IV ● A. amoeneus 201492 100● ● 99A. versicolor 143193 references 99● A. sydowii 50634 ● A. aurantiobrunneus 181502 ● A. spectabilis 176530 ● A. crustosus 367593 ● A. stella−maris 174927 99● A. filifera 166866 10099● A. venezuelensis 373864 ● A. olivicola 205876 ● A. undulatus 260666 96● A. varians 161990 ● A. unguis 141527 ● A. multicolor 156303 ● A. recurvatus 176526 ● A. stercoraria 240960 100● A. desertorum 133362 ● A. fructiculosus 206467 99● A. falconensis 151843 ● A. tetrazonus 178090 ● A. nidulans 1633 aspC ● A. acristatulus 189267 ● A. neoechinulatus 164326 100● A. similis 181597 ● A. indicus 169534 ● A. foveolata 88351 ● A. navahoensis 203597 ● A. olivicola 182929 ● A. filifera 162898 ● A. stella−maris 183931 99● A. venezuelensis 13497 ● A. undulatus 47073 ● A. karnatakaensis 198846 ● A. aurantiobrunneus 220647 ● A. unguis 87081 ● A. multicolor 167436 ● A. spectabilis 170228 ● A. varians 190334 ● A. versicolor 37124 100● A. amoeneus 209217 ● A. crustosus 379277 ● A. neoechinulatus 166926 ● A. foveolata 151470 ● A. nidulans 5274 aspD ● A. falconensis 171425 99● A. fructiculosus 117202 ● A. sydowii 52968 ● A. acristatulus 180309 ● A. recurvatus 164342 ● A. tetrazonus 19741 ● A. similis 184604 ● A. indicus 173605 ● A. stercoraria 173174 99● A. desertorum 137408 ● A. navahoensis 219575 ● A. ochraceoroseus 452725 ● A. aculeatinus 460251 ● A. clavatus 7653 100● A. fumigatus Af293 752 ● A. niger ATCC 1015 1098716 ● A. terreus 2643 ● 100A. amoeneus 91272 100● A. versicolor 141680 ● A. sydowii 34958 ● A. undulatus 267149 ● A. varians 332135 ● A. aurantiobrunneus 218715 96 ● A. venezuelensis 367792 97● A. stella−maris 20389 96● A. filifera 203598 ● A. olivicola 172032 ● 100A. crustosus 387357 100● A. spectabilis 146808 ● A. karnatakaensis 199684 ● A. unguis 166476 ● A. multicolor 185259 ● A. nidulans 4747 aspB ● A. foveolata 136768 ● A. similis 180226 ● A. recurvatus 192447 ● A. navahoensis 238983 ● A. indicus 25862 ● A. neoechinulatus 183107 ● A. tetrazonus 29842 100● A. acristatulus 105037 98● A. fructiculosus 211778 100● A. falconensis 148803 99● A. stercoraria 270251 ● A. desertorum 90807 99● A. ochraceoroseus 504584 100 ● A. aculeatinus 254273 100● A. niger ATCC 1015 1173846 ● A. clavatus 3448 100● A. fumigatus Af293 8940 ● A. terreus 9986 ● A. fructiculosus 174309 100● A. falconensis 142602 99● A. recurvatus 90731 ● A. navahoensis 94341 ● A. stercoraria 83671 100● A. desertorum 74088 ● A. tetrazonus 36917 96● A. nidulans 335 aspE 100● A. foveolata 111843 98100● A. acristatulus 20357 100● A. similis 170347 ● A. indicus 189788 ● A. multicolor 127838 ● 99A. stella−maris 87587 100● A. olivicola 153657 ● A. undulatus 199421 ● A. sydowii 87807 97● A. unguis 141252 ● A. spectabilis 192595 100 ● A. clavatus 5498 99● A. fumigatus Af293 3852 ● A. terreus 2858 ● A. ochraceoroseus 405099 ● A. aurantiobrunneus 175484 ● A. sydowii 59428 ● A. fumigatus Af293 6646 99● A. clavatus 2381 97● A. terreus 9338 ● A. aculeatinus 37896 ● A. niger ATCC 1015 1151441 ● A. ochraceoroseus 472320 ● A. amoeneus 195965 ● A. versicolor 53761 ● A. undulatus 271226 ● A. venezuelensis 234698 ● A. olivicola 194166 ● A. filifera 163854 ● A. varians 315270 ● A. stella−maris 185780 ● A. unguis 133145 ● 100A. crustosus 170832 98● A. spectabilis 162208 ● A. karnatakaensis 127244 ● A. multicolor 187733 ● A. neoechinulatus 152282 ● A. falconensis 187196 ● A. fructiculosus 187766 ● A. navahoensis 46330 ● A. recurvatus 153849 ● A. stercoraria 35276 ● A. desertorum 125285 ● A. indicus 185771 ● A. similis 186811 ● A. acristatulus 127487 ● A. foveolata 130647 ● A. nidulans 495 aspA ● A. tetrazonus 103091 0.9

Fig. S8. Maximum likelihood phylogeny of septin protein family.

164 Theobald et al. 10.1073/pnas.XXXXXXXXXX 9 of 11 ●A. navahoensis 36627 ●A. falconensis 149548 ●A. indicus 176797 ●A. desertorum 76321 ●A. similis 95390 ●A. stercoraria 242396 ●A. fructiculosus 178764 ●A. recurvatus 160072 ●A. nidulans 1908 nimX ●A. foveolata 135292 ●A. tetrazonus 138293 ●A. neoechinulatus 88240 ●A. acristatulus 64828 ●A. multicolor 8973 ●A. aurantiobrunneus 188833 ●A. unguis 9300 ●A. undulatus 250390 96●A. stella−maris 173388 ●A. filifera 110955 ●A. venezuelensis 392124 ●A. olivicola 227023 ●A. varians 285260 ●A. versicolor 44987 100●A. amoeneus 195073 ●A. sydowii 154295 ●A. karnatakaensis 227527 ●A. spectabilis 18401 ●A. crustosus 374113 ●A. ochraceoroseus 109745 ●A. aculeatinus 340042 ●A. niger ATCC 1015 1127329 ●A. clavatus 8615 ●A. fumigatus Af293 7874 ●A. terreus 9420 100 ●A. versicolor 764564 96●A. amoeneus 235913 ●A. sydowii 149497 ●A. spectabilis 172680 ● 99A. desertorum 89003 99●A. stercoraria 31658 ●A. navahoensis 10384 ●A. recurvatus 172533 ●A. indicus 10732 ●A. nidulans 1739 phoA ●A. tetrazonus 168729 ●A. foveolata 139765 100●A. fructiculosus 180138 100●A. falconensis 151605 ●A. multicolor 186120 ●A. neoechinulatus 119157 ●A. similis 191306 ●A. acristatulus 57251 ●A. varians 315721 ●A. aurantiobrunneus 188236 clade ● 100A. amoeneus 59997 100●A. versicolor 385057 ● A. aurantiobrunneus ●A. sydowii 161723 ● ● A. unguis 98A. venezuelensis 366708 99●A. stella−maris 154813 ●A. filifera 186953 ● A. varians ● 98A. olivicola 225870 ● ●A. undulatus 117649 clade_I 97 ● 100100A. crustosus 390046 ● clade_II 100●A. karnatakaensis 205043 100 ●A. spectabilis 166366 ● clade_III ●A. unguis 164821 ●A. ochraceoroseus 479529 ● clade_IV ●A. fumigatus Af293 6312 9999●A. clavatus 354 ● references 98●A. terreus 2139 99 ●A. niger ATCC 1015 1155842 ●A. aculeatinus 388804 ●A. niger ATCC 1015 36108 ●A. unguis 153320 ●A. ochraceoroseus 470440 ●A. niger ATCC 1015 1187550 ● 99 A. crustosus 397678 100●A. spectabilis 161282 ●A. karnatakaensis 44039 ●A. olivicola 193595 98 ●A. stella−maris 162369 96●A. venezuelensis 370092 98●A. filifera 165706 96●A. undulatus 253937 ●A. aurantiobrunneus 51426 ● 100A. versicolor 68978 100●A. amoeneus 200090 ●A. sydowii 141635 ●A. varians 187431 ●A. multicolor 174155 ●A. nidulans 7560 phoB ●A. foveolata 117680 98●A. acristatulus 194724 ●A. tetrazonus 199340 ●A. neoechinulatus 159574 ●A. stercoraria 25131 100●A. desertorum 129324 100●A. similis 187678 ●A. indicus 5130 ●A. recurvatus 138795 ●A. navahoensis 223813 ●A. falconensis 168352 ●A. fructiculosus 102525 0.9

Fig. S9. Maximum likelihood phylogeny of nimX, phoA, phoB protein family.

165 10 of 11 Theobald et al. 10.1073/pnas.XXXXXXXXXX clade ● A. aurantiobrunneus ● A. unguis ● A. varians ● clade_I ● A. desertorum 53 ● A. falconensis 104137 ● ● A. fructiculosus 22678 clade_II ● A. recurvatus 194631 ● A. acristatulus 61420 ● clade_III ● A. multicolor 67678 ● A. tetrazonus 67366 ● clade_IV ● A. neoechinulatus 90974 ● A. stercoraria 42737 ● references ● A. navahoensis 231773 ● A. foveolata 61343 ● A. similis 38955 ● A. indicus 127712 96● A. aurantiobrunneus 85238 97● A. unguis 127158 ● A. nidulans 3302 modA ● A. stella−maris 74583 100● A. filifera 191231 ● A. venezuelensis 202800 96● A. varians 70675 ● A. undulatus 126529 ● A. olivicola 62220 ● A. versicolor 85914 ● A. amoeneus 122256 ● A. sydowii 58457 ● A. ochraceoroseus 510065 96● A. spectabilis 181987 ● A. karnatakaensis 145928 ● A. crustosus 15570 ● A. clavatus 789 ● A. fumigatus Af293 2196 ● A. terreus 10326 100● A. aculeatinus 194244 99 ● A. niger ATCC 1015 52477 ● A. varians 357983 ● A. stella−maris 66981 ● A. filifera 195919 ● A. venezuelensis 376260 ● A. unguis 19833 ● A. undulatus 202713 ● A. olivicola 199597 ● A. aurantiobrunneus 78549 ● A. varians 120122 ● A. sydowii 822203 100● A. amoeneus 219006 ● A. versicolor 586268 ● A. multicolor 149941 ● A. crustosus 389190 99● 97A. spectabilis 178199 100 ● A. karnatakaensis 157524 ● A. stercoraria 245376 ● A. desertorum 141530 ● A. nidulans 410 racA ● A. similis 103397 ● A. foveolata 169162 ● A. navahoensis 214679 ● A. recurvatus 88865 ● A. tetrazonus 38491 ● A. acristatulus 18333 98● A. indicus 165142 ● A. falconensis 33381 100● A. fructiculosus 168315 ● A. neoechinulatus 152031 ● A. ochraceoroseus 515967 98 ● A. niger ATCC 1015 1147254 ● A. aculeatinus 10173 ● A. terreus 2936 ● A. clavatus 5427 ● A. fumigatus Af293 3778 ● A. nidulans 9493 rho4 ● A. tetrazonus 26867 100● A. neoechinulatus 163561 ● A. foveolata 106345 100● A. similis 21888 ● A. acristatulus 144955 ● A. unguis 88150 ● A. recurvatus 36416 ● A. indicus 18058 ● A. navahoensis 144734 ● A. fructiculosus 65183 ● A. falconensis 27954 ● A. desertorum 40525 99● A. stercoraria 37588 ● A. multicolor 120861 ● A. versicolor 715463 99● 100A. amoeneus 117583 ● A. sydowii 91456 ● A. varians 49211 ● A. aurantiobrunneus 189701 ● A. fumigatus Af293 7175 100● A. clavatus 2868 100● A. aculeatinus 213058 96100● A. niger ATCC 1015 1131105 ● A. terreus 9678 ● A. ochraceoroseus 500141 ● A. spectabilis 156747 ● A. crustosus 136360 ● A. karnatakaensis 222397 ● A. undulatus 180175 100● A. olivicola 121420 ● A. stella−maris 64707 100● 100A. venezuelensis 353540 ● A. filifera 81254 ● A. tetrazonus 14468 ● A. foveolata 129823 ● A. acristatulus 188361 ● A. recurvatus 181532 ● A. nidulans 164 ● A. falconensis 3927 ● A. unguis 156801 ● A. navahoensis 99045 ● A. multicolor 151268 100 ● A. stercoraria 153012 100● A. desertorum 111193 ● A. fructiculosus 142678 ● A. indicus 158282 ● A. similis 193545 ● A. neoechinulatus 151347 ● A. varians 53905 ● A. amoeneus 9006 99● A. versicolor 564202 ● A. sydowii 87638 ● A. aurantiobrunneus 126055 ● A. crustosus 191318 99● A. spectabilis 159980 ● A. karnatakaensis 158765 ● A. ochraceoroseus 194986 ● A. venezuelensis 353116 ● A. filifera 154576 ● A. stella−maris 195990 99● A. undulatus 281989 ● A. olivicola 59563 ● A. undulatus 248616 ● A. aculeatinus 439596 99● 100 A. niger ATCC 1015 1091425 ● A. terreus 8131 ● A. fumigatus Af293 4175 ● A. clavatus 5817 ● A. desertorum 98740 98● A. stercoraria 82442 ● A. recurvatus 89808 ● A. fructiculosus 174218 98 ● A. falconensis 152220 ● A. navahoensis 186221 ● A. indicus 165312 ● A. similis 125426 ● A. nidulans 363 99● A. tetrazonus 164355 ● A. foveolata 76772 ● A. neoechinulatus 76826 ● A. acristatulus 184165 99 ● A. varians 358382 97● A. multicolor 128962 ● A. stella−maris 67799 ● A. filifera 165017 ● A. olivicola 192740 ● A. venezuelensis 369690 ● A. unguis 20583 ● A. versicolor 582335 100● 98100A. amoeneus 13119 ● A. sydowii 816513 ● A. aurantiobrunneus 79477 ● A. undulatus 201424 ● A. spectabilis 201775 99● 99 A. karnatakaensis 169463 ● A. crustosus 80807 ● A. aculeatinus 387438 100● 10097 A. niger ATCC 1015 1128039 ● A. terreus 2889 ● A. ochraceoroseus 435074 ● A. clavatus 5466 100● A. fumigatus Af293 3820 ● A. terreus 6478 ● A. clavatus 3658 ● A. fumigatus Af293 7767 ● A. ochraceoroseus 502790 ● A. unguis 135185 ● A. aculeatinus 87714 ● A. niger ATCC 1015 1147694 ● A. crustosus 389544 99● 99A. spectabilis 185687 ● A. karnatakaensis 218279 ● A. stercoraria 160154 ● A. desertorum 136688 ● A. similis 182276 ● A. recurvatus 171122 ● A. falconensis 154397 ● A. navahoensis 234895 ● A. indicus 175241 ● A. fructiculosus 176545 ● A. neoechinulatus 149180 100● A. undulatus 289499 100● A. filifera 157425 ● A. olivicola 210444 99● A. stella−maris 180359 ● A. venezuelensis 377720 ● A. versicolor 42450 ● A. sydowii 149945 99● A. amoeneus 220796 ● A. varians 334110 ● A. multicolor 182647 ● A. aurantiobrunneus 173949 ● A. foveolata 160469 ● A. tetrazonus 169257 ● A. acristatulus 213715 ● A. nidulans 10130 rhoA 0.9

Fig. S10. Maximum likelihood phylogeny of rho4, racA, modA protein family.

166 Theobald et al. 10.1073/pnas.XXXXXXXXXX 11 of 11