BioRxiv 2018, X Supplementary Information for:

A computational framework for systematic exploration of biosynthetic diversity from large- scale genomic data

Jorge C. Navarro-Muñoz1,2*, Nelly Selem-Mojica3*, Michael W. Mullowney4*, Satria Kautsar1, James H. Tryon4, Elizabeth Parkinson5$, Emmanuel L.C. De Los Santos6, Marley Yeong1, Pablo Cruz-Morales3, Sahar Abubucker7, Arne Roeters1, Wouter Lokhorst1, Antonio Fernandez-Guerra8, Luciana Teresa Dias Cappelini4, Regan Thomson4, William W. Metcalf5, Neil L. Kelleher4, Francisco Barona-Gomez3#, Marnix H. Medema1#

1Bioinformatics Group, Wageningen University, The Netherlands. 2Westerdijk Fungal Biodiversity Institute, The Netherlands 3Evolution of Metabolic Diversity Laboratory, Unidad de Geómica Avanzada (Langebio), Cinvestav- IPN, Irapuato, Mexico. 4Department of Chemistry, Northwestern University, Evanston, Illinois, United States. 5Carl R. Woese Institute for Genomic Biology and Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States. 6Warwick Integrative Synthetic Biology Centre, University of Warwick, Coventry, United Kingdom. 7Novartis Institutes for BioMedical Research, Cambridge, United States. 8Microbial Genomics and Bioinformatics, Max Planck Institute for Marine Microbiology, Bremen, Germany.

*joint first authors #joint corresponding authors

E-mail: [email protected], [email protected]

$Current address: Department of Chemistry, Purdue University, West Lafayette, Indiana, United States.

BioRxiv 2018, X BiG-SCAPE parameters setup 5 Text S1. Index Information 5 Table S1. Manually Curated Compound Groups 6 Text S2. BiG-SCAPE classes weight optimization 16 Figure S1. Scatterplots with weight optimization results. 21 Text S3. MIBiG network cutoff analysis 23 Figure S2. Targeted attack of the MIBiG network 24 Text S4. Clustering analysis on the MIBiG network 25 Figure S3. Clustering analysis on the MIBiG network using glocal mode 25 Figure S4. Clustering analysis on the MIBiG network using global mode 26 Table S2. Clustering methods used to identify potential BGC families in the training networks. 27 Figure S5. Alluvial plot depicting BiG-SCAPE’s assignation of MIBiG BGCs to Gene Cluster Families against the manual curated groups in Table S1. 28 Text S5. Actinobacteria Genomes Data Set 30 Other BiG-SCAPE information 31 Table S3. Compute times for BiG-SCAPE calculations for datasets of multiple sizes. 31 Table S4. Default Anchor Domain List used in the DSS index. 31 Table S5. Logic for internal class separation. 32 Figure S6. Flowchart of BiG-SCAPE components. 33 Figure S9. Verified clusters encoding rimosamide biosynthesis. Signature gene(s) highlighted. 35 Figure S10. Rimosamide-related GCFs in Fig 3a. 36 Figure S11. Phylogenomic reconstruction of 103 complete Streptomyces genomes with outgroups Catenulispora acidiphila CP001700.1 and Salinispora arenicola CP000850.1. 39 Figure S12. BiG-SCAPE / CORASON GCFs of the closed Streptomyces genomes. 40 Figure S13. tauD Actinobacteria EvoMining expansions tree. 41 Table S6. Occurrences of tauD homologues in secondary metabolism BGCs as reported in MIBiG. 42 Microbiology and metabolomics methods 43 Text S6. Cultivation of actinomycetes for MS-based metabolomics. 43 Text S7. Acquisition and analysis of LC-MS metabolomics data. 43 Metabologenomic correlations 43 Table S7. Size of each correlation round resulting from various BiG-SCAPE outputs. 43 Fig. S14. Binary correlation score chart between ions and GCFs used for metabologenomics. 44 Table S8. Correlation scores for known ion-GCF pairs in the dataset across the four correlation rounds. The correlation score is highlighted in yellow if it was higher than the estimated 1% FDR threshold. 44

Text S8. Metabolic labeling of detoxins N1, N2, P1, P2, and P3 with stable isotope-labeled amino acids. 45 BioRxiv 2018, X Metabolomics data and structural analysis 46

Structural characterization of detoxin S1 from the P450/enoyl clade. 46

Fig. S15. Tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S- 325. 47

Fig. S16. Comparison of the tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S-325 with the tandem MS spectrum of detoxin B3. 48

Structural characterization of detoxins N1 and N2 from the supercluster clade. 49

Fig. S17. Tandem MS spectrum of detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792. 50 13 Fig. S18. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792. 51

Fig. S19. Tandem MS spectrum of detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792. 52 13 Fig. S20. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792. 53

Structural characterization of detoxins P1, P2, and P3 from the Amycolatopsis/P450 clade. 54

Fig. S21. Tandem MS spectrum of detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 55

Fig. S22. Stable isotope labeled-amino acid incorporation of d7-proline and d8-valine with loss of one deuteron in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 56

Fig. S23. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 57

Fig. S24. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 58

Fig. S25. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 59

Fig. S26. Tandem MS spectrum of detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 60 15 Fig. S27. Stable isotope labeled-amino acid incorporation of d7-proline, d8, N-phenylalanine, and d8- valine with loss of one deuteron in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 61

Fig. S28. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 62

Fig. S29. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 63

Fig. S30. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 64

Fig. S31. Tandem MS spectrum of detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 65 BioRxiv 2018, X

Fig. S32. Stable isotope labeled-amino acid incorporation of d7-proline, d8-phenylalanine, and d8- valine with loss of one deuteron in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 66

Fig. S33. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 67 REFERENCES 68

BioRxiv 2018, X BiG-SCAPE parameters setup

Text S1. Index Information The distance between two Biosynthetic Gene Clusters (BGCs) A and B is calculated by combining three similarity scores:

Jaccard Index: a coefficient of all distinct shared types of domains divided by the total number of distinct domain types: ! ! �! ∩ �! ��!" = ! ! �! ∪ �! ! where �! ! is the total number of domains in BGC A (B) and �! ! is the total number of distinct domains in BGC A (B)

Adjacency Index: a coefficient of all distinct shared pairs of domains divided by the total number of distinct pairs of domain types: ! ! �! ∩ �! ��!" = ! ! �! ∪ �! ! where �! ! are all the different adjacent domain pairs types in BGC A (B), e.g.: ! �! = �! � , �! � + 1 ∣ � ∈ 0, 1, 2, … , �! − 1 where �! is the length of the list of domains predicted in BGC A. For genes in multi-record files (e.g. with information from different loci, only in case of MIBiG reference gene clusters), the pairs will be formed in the order in which the domains appear in the original file.

Domain Sequence Similarity: a score that considers the sequence similarities for every domain type. ��� = 1 − ���! Where ���! is the Domain Sequence Dissimilarity. This is divided further into two subcomponents, ���! = �!���! + �!���!". The first accounts for so-called anchor domains, a list of domains which can be given a special weight (for a list of default anchor domains �, which contains well-known domains for e.g. NRPS or PKS BGCs, see Table S4) while the second one accounts for the rest of domains ��. Each (non)anchor subcomponent is calculated in the same manner: 1 1 ���! = �� �!, �! , ���!" = �� �!, �! �! �!" ! ! Where �! = !∈{!} max �! , �! ; d are all distinct domain types in the pair that belong to the list of ! anchor domains and �!(!) are the number of copies of domain d in BGC A(B). sd is a function that takes all copies of domain d in A and B, and returns the sum of the complement of the sequence similarity, (1 − ��, the latter calculated with domain sequences aligned against their hmm profile using hmmalign) of the best matching copies of the same domain type (using the Hungarian algorithm). For extra copies that don’t have a match or unshared domains, the function returns 1 (a complete dissimilarity). Finally, if there exist domains of each kind in the pair, both subcomponents are weighted first proportionally to the total number of domains of each type (including copies): �! �!" �! = , �!" = �! + �!" �! + �!" and then re-weighted to increase the perceived amount of anchor domains through the “anchorboost” parameter: �!×���ℎ������� �!" �! = , �! = �!×���ℎ������� + �!" �!×���ℎ������� + �!" BioRxiv 2018, X

Table S1. Manually Curated Compound Groups MIBiG Class Subclass Group Compound accession Polyketides Type I Mechanism Antifungal polyenes amphotericin BGC0000015 Polyketides Type I Mechanism Antifungal polyenes amphotericin B BGC0000016 Polyketides Type I Mechanism Antifungal polyenes nystatin A1 BGC0000115 Polyketides Type I Mechanism Antifungal polyenes pimaricin BGC0000125 Polyketides Type I Mechanism Antifungal polyenes candicidin BGC0000034 Polyketides Type I Mechanism Antifungal polyenes natamycin BGC0000108 Polyketides Type I Mechanism Antifungal polyenes rimocidin BGC0000138 Polyketides Type I Mechanism Macrolides (14-membered) erythromycin BGC0000054 Polyketides Type I Mechanism Macrolides (14-membered) megalomicin A BGC0000092 Polyketides Type I Mechanism Macrolides (14-membered) megalomicin B BGC0000092 Polyketides Type I Mechanism Macrolides (14-membered) megalomicin C1 BGC0000092 Polyketides Type I Mechanism Macrolides (14-membered) megalomicin C2 BGC0000092 Polyketides Type I Mechanism Macrolides (14-membered) lankamycin BGC0000085 BGC0000018, Polyketides Type I Mechanism Macrolides (16-membered) angolamycin BGC0000019 Polyketides Type I Mechanism Macrolides (16-membered) chalcomycin BGC0000035 Polyketides Type I Mechanism Macrolides (16-membered) mycinamicin BGC0000102 Polyketides Type I Mechanism Macrolides (16-membered) midecamycin BGC0000096 Polyketides Type I Mechanism Macrolides (16-membered) niddamycin BGC0000113 Polyketides Type I Mechanism Macrolides (18-membered) concanamycin A BGC0000040 Polyketides Type I Mechanism Macrolides (18-membered) leucanicidin BGC0001232 Polyketides Type I Mechanism Macrolides (18-membered) bafilomycin B1 BGC0000028 Macrocyclic lactones (16- BGC0000109 Polyketides Type I Mechanism membered) nemadectin Macrocyclic lactones (16- BGC0000093 Polyketides Type I Mechanism membered) meilingmycin Macrocyclic lactones (16- BGC0000025 Polyketides Type I Mechanism membered) avermectin Polyketides Type I Mechanism Macrolactams (20-membered) BE-14106 BGC0000029 Polyketides Type I Mechanism Macrolactams (20-membered) ML-449 BGC0000097 Polyketides Type I Mechanism Macrolactams (26-membered) salinilactam BGC0000142 Polyketides Type I Mechanism Macrolactams (26-membered) micromonolactam BGC0000095 Polyketides Type I Mechanism Polyethers nigericin BGC0000114 Polyketides Type I Mechanism Polyethers nanchangmycin BGC0000105 Polyketides Type I Mechanism Polyethers monensin BGC0000100 Polyketides Type I Mechanism Polyethers laidlomycin BGC0000084 BGC0000143, Polyketides Type I Mechanism Polyethers salinomycin BGC0000144 BGC0000086, Polyketides Type I Mechanism Polyethers lasalocid BGC0000087 Polyketides Type I Mechanism Statins compactin BGC0000039 Polyketides Type I Mechanism Statins monacolin K BGC0000098 BGC0000088, Polyketides Type I Mechanism Statins lovastatin BGC0000089 BioRxiv 2018, X Polyketides Type I Mechanism Naphthoquinone ansamycins divergolide A BGC0001119 Polyketides Type I Mechanism Naphthoquinone ansamycins divergolide B BGC0001119 Polyketides Type I Mechanism Naphthoquinone ansamycins divergolide C BGC0001119 Polyketides Type I Mechanism Naphthoquinone ansamycins divergolide D BGC0001119 Polyketides Type I Mechanism Naphthoquinone ansamycins hygrocin A BGC0000075 Polyketides Type I Mechanism Naphthoquinone ansamycins hygrocin B BGC0000075 Polyketides Type I Mechanism Benzoquinone ansamycins macbecin I BGC0000090 Polyketides Type I Mechanism Benzoquinone ansamycins macbecin II BGC0000090 BGC0000067, Polyketides Type I Mechanism Benzoquinone ansamycins geldanamycin BGC0000068 Polyketides Type I Mechanism Benzoquinone ansamycins herbimycin BGC0000074 Polyketides Type I Mechanism Nitroaryl-substituted polyketides aureothin BGC0000024 Polyketides Type I Mechanism Nitroaryl-substituted polyketides neoaureothin BGC0000110 Polyketides Type I Mechanism Nitroaryl-substituted polyketides orinocin BGC0000110 Polyketides Type I Mechanism Nitroaryl-substituted polyketides SNF4435C BGC0000110 Polyketides Type I Mechanism Nitroaryl-substituted polyketides SNF4435D BGC0000110 Polyketides Type I Mechanism PKS_I_1 tetronasin BGC0000163 Polyketides Type I Mechanism PKS_I_1 tetronomycin BGC0000164 Polyketides Type I Mechanism PKS_I_2 streptolydigin BGC0001046 Polyketides Type I Mechanism PKS_I_2 tirandamycin BGC0001052 Polyketides Type I Mechanism PKS_I_3 cremimycin BGC0000042 Polyketides Type I Mechanism PKS_I_3 BGC0001194 Polyketides Type I Mechanism PKS_I_4 nataxazole BGC0001213 Polyketides Type I Mechanism PKS_I_4 A33853 BGC0001292 Polyketides Type I Mechanism PKS_I_5 cinnabaramide BGC0000971 Polyketides Type I Mechanism PKS_I_5 salinosporamide A BGC0000145 Polyketides Type I Mechanism PKS_I_6 jerangolid A BGC0000080 Polyketides Type I Mechanism PKS_I_6 jerangolid D BGC0000080 Polyketides Type I Mechanism PKS_I_6 ambruticin BGC0000014 Polyketides Type I Mechanism Pks_Ene_I C-1027 BGC0000965 Polyketides Type I Mechanism Pks_Ene_I neocarzinostatin BGC0000112 Polyketides Type I Mechanism Pks_Ene_I maduropeptin BGC0001008 BGC0001108, BGC0001109, Polyketides Type I Mechanism PKS_Trans_1 pederin BGC0000255 Polyketides Type I Mechanism PKS_Trans_1 pseudopederin BGC0001108 Polyketides Type I Mechanism PKS_Trans_1 pederone BGC0001108 Polyketides Type I Mechanism PKS_Trans_1 psymberin BGC0001110 Polyketides Type I Mechanism PKS_Trans_1 irciniastatin B BGC0001110 Polyketides Type I Mechanism PKS_Trans_1 onnamide A BGC0001105 Polyketides Type I Mechanism PKS_Trans_2 dorrigocin A BGC0000177 Polyketides Type I Mechanism PKS_Trans_2 dorrigocin B BGC0000177 Polyketides Type I Mechanism PKS_Trans_2 13-epi-Dorrigocin A BGC0000177 Polyketides Type I Mechanism PKS_Trans_2 migrastatin BGC0000177 Polyketides Type I Mechanism PKS_Trans_2 iso-migrastatin BGC0000177 Polyketides Type I Mechanism PKS_Trans_2 9-methylstreptimidone BGC0000171 Polyketides Type I Mechanism PKS_Trans_2 cycloheximide BGC0000175 BioRxiv 2018, X Polyketides Type I Mechanism PKS_Trans_2 lactimidomycin BGC0000083 Polyketides Type I Mechanism PKS_Trans_2 8,9-dihydro-LTM BGC0000083 8,9-dihydro-8S-hydroxy- BGC0000083 Polyketides Type I Mechanism PKS_Trans_2 LTM 8,9-dihydro-9R-hydroxy- BGC0000083 Polyketides Type I Mechanism PKS_Trans_2 LTM Polyketides Type I Mechanism PKS_Trans_3 chlorotonil BGC0001299 BGC0001300, Polyketides Type I Mechanism PKS_Trans_3 anthracimycin BGC0001301 Polyketides Type I Mechanism PKS_Trans_4 spliceostatin BGC0001113 Polyketides Type I Mechanism PKS_Trans_4 thailanstatin A BGC0001114 Polyketides Type I Mechanism PKS_Trans_4 FR901464 BGC0001113 Polyketides Type I Mechanism PKS_Iter_1 tenellin BGC0001049 Polyketides Type I Mechanism PKS_Iter_1 aspyridone A BGC0000959 Polyketides Type I Mechanism PKS_Iter_1 desmethylbassianin BGC0001136 BGC0000004, BGC0000005, BGC0000006, BGC0000007, BGC0000008, BGC0000009, Polyketides Type I Mechanism PKS_Iter_2 aflatoxin BGC0000010 BGC0000011, Polyketides Type I Mechanism PKS_Iter_2 sterigmatocystin BGC0000152 BGC0000064, Polyketides Type I Mechanism PKS_Iter_3 fusarin BGC0001268 Polyketides Type I Mechanism PKS_Iter_3 NG-391 BGC0001026 Polyketides Type I Mechanism PKS_Iter_4 equisetin BGC0001255 Polyketides Type I Mechanism PKS_Iter_4 fusaridione A BGC0000992 Polyketides Type II Mechanism PKS_II_1 landomycin A BGC0000239 Polyketides Type II Mechanism PKS_II_1 grincamycin BGC0000229 Polyketides Type II Mechanism PKS_II_2 steffimycin BGC0000273 Polyketides Type II Mechanism PKS_II_2 arimetamycin A BGC0000199 Polyketides Type II Mechanism PKS_II_2 arimetamycin B BGC0000199 Polyketides Type II Mechanism PKS_II_2 arimetamycin C BGC0000199 Polyketides Type II Mechanism PKS_II_3 rubromycin BGC0000266 Polyketides Type II Mechanism PKS_II_3 griseorhodin A BGC0000230 Polyketides Type II Mechanism PKS_II_4 lysolipin BGC0000242 Polyketides Type II Mechanism PKS_II_4 xantholipin BGC0000279 Polyketides Type II Mechanism PKS_II_5 ravidomycin BGC0000263 Polyketides Type II Mechanism PKS_II_5 chrysomycin BGC0000211 Polyketides Type II Mechanism PKS_II_5 gilvocarcin BGC0000226 Polyketides Type II Mechanism PKS_II_6 dactylocycline BGC0000216 Polyketides Type II Mechanism PKS_II_6 chlortetracycline BGC0000209 Polyketides Type II Mechanism PKS_II_6 oxytetracycline BGC0000254 Polyketides Type II Mechanism PKS_II_7 asukamycin BGC0000187 Polyketides Type II Mechanism PKS_II_7 colabomycin E BGC0000213 Polyketides Type II Mechanism PKS_II_8 doxorubicin BGC0000218 Polyketides Type II Mechanism PKS_II_8 daunorubicin BGC0000217 BioRxiv 2018, X Polyketides Type III Mechanism PKS_III_1 flaviolin BGC0000285 Polyketides Type III Mechanism PKS_III_1 Flaviolin rhamnoside BGC0000285 Polyketides Type III Mechanism PKS_III_1 3,3'-diflaviolin BGC0000285 Polyketides Other Mechanism PKS_X_1 pamamycin BGC0001150 Polyketides Other Mechanism PKS_X_1 nonactin BGC0000252 BGC0000243, Polyketides Other Mechanism PKS_X_1 macrotetrolide BGC0000244 Polyketides/ BGC0001040 NRPs NRP_PKS_I rapamycin BGC0001011, Polyketides/ BGC0001012, NRPs NRP_PKS_I meridamycin BGC0001013 Polyketides/ BGC0000353 NRPs NRP_PKS_I FK506 Polyketides/ BGC0000994 NRPs NRP_PKS_I FK520 Polyketides/ BGC0000383 NRPs NRPS_PKS_II luminmycin Polyketides/ BGC0000997 NRPs NRPS_PKS_II glidobactin Polyketides/ BGC0000966 NRPs NRPS_PKS_III caerulomycin A Polyketides/ BGC0000973 NRPs NRPS_PKS_III collismycin A Polyketides/ BGC0000964 NRPs NRPS_PKS_IV burkholdac A Polyketides/ BGC0000993 NRPs NRPS_PKS_IV FK228 Polyketides/ BGC0001045 NRPs NRPS_PKS_IV spiruchostatin A Polyketides/ BGC0001057 NRPs NRPS_PKS_V zearalenone Polyketides/ BGC0001245 NRPs NRPS_PKS_V lasiodiplodin Polyketides/ BGC0001246 NRPs NRPS_PKS_V trans-resorcylide Polyketides/ BGC0000045 NRPs NRPS_PKS_V dehydrocurvularin Polyketides/ BGC0000076, NRPs NRPS_PKS_V hypothemycin BGC0000077 Polyketides/ BGC0000134 NRPs NRPS_PKS_V radicicol NRPs Glycopeptides BGC0000455 NRPs Glycopeptides balhimycin BGC0000311 NRPs Glycopeptides A40926 BGC0000289 NRPs Glycopeptides ristocetin BGC0000418 NRPs Glycopeptides ristomycin A BGC0000419 BGC0000440, NRPs Glycopeptides BGC0000441 NRPs Glycopeptides A47934 BGC0000290 NRPs Glycopeptides complestatin BGC0000326 NRPs Glycopeptides chloroeremomycin BGC0000322 BioRxiv 2018, X NRPs Glycopeptides uk-68,597 BGC0001178 NRPs Pseudomonas lipopeptides I WLIP BGC0000462 NRPs Pseudomonas lipopeptides I massetolide A BGC0000389 NRPs Pseudomonas lipopeptides I orfamide A BGC0000399 BGC0000398, NRPs Pseudomonas lipopeptides I orfamide B BGC0000399 NRPs Pseudomonas lipopeptides I orfamide C BGC0000399 NRPs Pseudomonas lipopeptides I poaeamide A BGC0001208 NRPs Pseudomonas lipopeptides I poaeamide B BGC0001347 NRPs Pseudomonas lipopeptides II xantholysin A BGC0000463 NRPs Pseudomonas lipopeptides II xantholysin B BGC0000463 NRPs Pseudomonas lipopeptides II xantholysin C BGC0000463 NRPs Pseudomonas lipopeptides II entolysin BGC0000344 NRPs Pseudomonas lipopeptides II putisolvin BGC0000411 NRPs Pseudomonas lipopeptides III cichofactin A BGC0000323 NRPs Pseudomonas lipopeptides III cichofactin B BGC0000323 NRPs Pseudomonas lipopeptides III syringafactin BGC0000435 NRPs Pseudomonas lipopeptides IV sessilin A BGC0000425 NRPs Pseudomonas lipopeptides IV tolaasin I BGC0000447 NRPs Pseudomonas lipopeptides IV tolaasin F BGC0000447 NRPs Streptomyces lipopeptides CDA1b BGC0000315 NRPs Streptomyces lipopeptides CDA2a BGC0000315 NRPs Streptomyces lipopeptides CDA2b BGC0000315 NRPs Streptomyces lipopeptides CDA3a BGC0000315 NRPs Streptomyces lipopeptides CDA3b BGC0000315 NRPs Streptomyces lipopeptides CDA4a BGC0000315 NRPs Streptomyces lipopeptides CDA4b BGC0000315 NRPs Streptomyces lipopeptides taromycin A BGC0000439 NRPs Streptomyces lipopeptides BGC0000336 NRPs Pyrrolobenzodiazepines sibiromycin BGC0000428 NRPs Pyrrolobenzodiazepines anthramycin BGC0000303 NRPs Pyrrolobenzodiazepines tomaymycin BGC0000448 NRPs Pyrrolobenzodiazepines porothramycin BGC0000409 NRPs NRP siderophores enterobactin BGC0000343 NRPs NRP siderophores vanchrobactin BGC0000454 NRPs NRP siderophores vibriobactin BGC0000945 BGC0000309, NRPs NRP siderophores bacillibactin BGC0001185 NRPs NRP siderophores mirubactin BGC0000392 NRPs NRP siderophores vulnibactin BGC0000460 NRPs NRP siderophores fimsbactin A BGC0000352 NRPs NRP siderophores acinetobactin BGC0000294 NRPs Hydroxamate-type siderophores coelichelin BGC0000325 NRPs Hydroxamate-type siderophores scabichelin BGC0000423 NRPs Hydroxamate-type siderophores heterobactin A BGC0000371 NRPs Hydroxamate-type siderophores heterobactin S2 BGC0000371 NRPs Hydroxamate-type siderophores erythrochelin BGC0000349 BioRxiv 2018, X NRPs Tuberculostatic viomycin BGC0000458 NRPs Tuberculostatic antibiotics capreomycin BGC0000316 NRPs Lincosamides celesticetin BGC0001225 NRPs Lincosamides lincomycin BGC0000907 NRPs Cyclododecapeptides valinomycin BGC0000453 NRPs Cyclododecapeptides cereulide BGC0000320 NRPs Cyanobacterial peptins cyanopeptin BGC0000331 NRPs Cyanobacterial peptins cyanopeptolin BGC0000332 NRPs Cyanobacterial peptins micropeptin k139 BGC0001018 Bleomycin-like antitumor BGC0001058 NRPs antibiotics zorbamycin Bleomycin-like antitumor BGC0000963 NRPs antibiotics bleomycin Bleomycin-like antitumor BGC0001048 NRPs antibiotics tallysomycin A NRPs Quinomycin antibiotics triostin A BGC0000450 NRPs Quinomycin antibiotics SW-163C BGC0000434 NRPs Quinomycin antibiotics SW-163D BGC0000434 NRPs Quinomycin antibiotics SW-163E BGC0000434 NRPs Quinomycin antibiotics SW-163F BGC0000434 NRPs Quinomycin antibiotics SW-163G BGC0000434 NRPs Quinomycin antibiotics thiocoraline BGC0000445 NRPs Quinomycin antibiotics quinomycin BGC0000415 NRPs Quinomycin antibiotics echinomycin BGC0000339 NRPs Quinomycin antibiotics retimycin A BGC0001228 NRPs Capuramycin-type antibiotics A-500359 A BGC0000949 NRPs Capuramycin-type antibiotics A-500359 B BGC0000949 NRPs Capuramycin-type antibiotics A-503083 A BGC0000288 NRPs Capuramycin-type antibiotics A-503083 B BGC0000288 NRPs Capuramycin-type antibiotics A-503083 F BGC0000288 NRPs Capuramycin-type antibiotics A-503083 E BGC0000288 Myxobacterial thiazole BGC0001024 NRPs polyketides myxothiazol Myxobacterial thiazole BGC0001009 NRPs polyketides melithiazol Myxobacterial thiazole BGC0000982 NRPs polyketides cystothiazole A BGC0000327, NRPs Cyclic diamidines congocidine BGC0001147 NRPs Cyclic diamidines distamycin BGC0001147 NRPs Cyclic diamidines disgocidine BGC0001147 NRPs Indolactams pendolmycin BGC0000391 NRPs Indolactams methylpendolmycin BGC0000391 NRPs Indolactams lyngbyatoxin BGC0000384 NRPs Indolactams teleocidin B BGC0001085 NRPs Pyrrothine antibiotics thiolutin BGC0001193 NRPs Pyrrothine antibiotics holomycin BGC0000373 NRPs Uridyl peptide antibiotics pacidamycin 1 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin 2 BGC0000951 BioRxiv 2018, X NRPs Uridyl peptide antibiotics pacidamycin 3 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin 4 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin 5 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin 6 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin 7 BGC0000951 NRPs Uridyl peptide antibiotics pacidamycin D BGC0000951 NRPs Uridyl peptide antibiotics napsamycin A BGC0000950 NRPs Uridyl peptide antibiotics napsamycin B BGC0000950 NRPs Uridyl peptide antibiotics napsamycin C BGC0000950 NRPs Uridyl peptide antibiotics mureidomycin A BGC0000950 NRPs Uridyl peptide antibiotics mureidomycin B BGC0000950 NRPs Beta-lactam antibiotics C BGC0000317 NRPs Beta-lactam antibiotics BGC0000404 NRPs Beta-lactam antibiotics cephamycin C BGC0000319 NRPs Rhabdopeptides xenortide A BGC0000465 NRPs Rhabdopeptides xenortide B BGC0000465 NRPs Rhabdopeptides xenortide C BGC0000465 NRPs Rhabdopeptides xenortide D BGC0000465 NRPs Rhabdopeptides rhabdopeptide 1 BGC0000416 NRPs Rhabdopeptides rhabdopeptide 2 BGC0000416 NRPs Rhabdopeptides rhabdopeptide 3 BGC0000416 NRPs Rhabdopeptides rhabdopeptide 4 BGC0000416 NRPs Fumitremorgin-type alkaloids fumitremorgin B BGC0000356 NRPs Fumitremorgin-type alkaloids fumitremorgin c BGC0000356 NRPs Fumitremorgin-type alkaloids verruculogen BGC0000356 NRPs Fumitremorgin-type alkaloids tryprostatin A BGC0000356 NRPs Fumitremorgin-type alkaloids tryprostatin B BGC0000356 demethoxyfumitremorgin BGC0000356 NRPs Fumitremorgin-type alkaloids C NRPs Fumitremorgin-type alkaloids brevianamide F BGC0000356 BGC0001084, NRPs Fumitremorgin-type alkaloids notoamide A BGC0000818 NRPs Fumitremorgin-type alkaloids paxilline BGC0001082 NRPs Fumitremorgin-type alkaloids paspaline BGC0001082 NRPs Fumitremorgin-type alkaloids 13-desoxypaxilline BGC0001082 NRPs Fumitremorgin-type alkaloids paspaline B BGC0001082 NRPs Fumitremorgin-type alkaloids terpendole E BGC0001260 BGC0000821, BGC0000822, Alkaloids Indolocarbazoles rebeccamycin BGC0000823 BGC0000825, BGC0000826, Alkaloids Indolocarbazoles staurosporine BGC0000827 BGC0000813, Alkaloids Indolocarbazoles K-252a BGC0000814 Alkaloids Indolocarbazoles AT2433 BGC0000809 RiPPs Thiopeptides I thiomuracin BGC0000613 RiPPs Thiopeptides I GE2270 BGC0001155 BioRxiv 2018, X RiPPs Thiopeptides I GE2270A BGC0000604 RiPPs Thiopeptides I GE37468 BGC0000605 BGC0000611, RiPPs Thiopeptides II siomycin BGC0000655 RiPPs Thiopeptides II thiostrepton BGC0000614 BGC0000608, RiPPs Thiopeptides III nocathiacin BGC0000609 RiPPs Thiopeptides III nosiheptide BGC0000610 RiPPs Thiopeptides IV lactocillin BGC0000628 RiPPs Thiopeptides IV thiocillin I BGC0000612 RiPPs Lanthipeptides_I nisin Z BGC0000538 RiPPs Lanthipeptides_I salivaricin 9 BGC0000547 RiPPs Lanthipeptides_I salivaricin A BGC0000548 RiPPs Lanthipeptides_I salivaricin D BGC0000549 RiPPs Lanthipeptides_I salivaricin G32 BGC0000550 salivaricin CRL1328 alpha BGC0000624 RiPPs Lanthipeptides_I peptide salivaricin CRL1328 beta BGC0000624 RiPPs Lanthipeptides_I peptide RiPPs Lanthipeptides II gallidermin BGC0000514 RiPPs Lanthipeptides II epidermin BGC0000508 RiPPs Lanthipeptides III entianin BGC0000506 RiPPs Lanthipeptides III subtilin BGC0000559 RiPPs Lanthipeptides III ericin S BGC0000511 RiPPs Lanthipeptides IV microbisporicin A2 BGC0000529 RiPPs Lanthipeptides IV planosporicin BGC0000544 RiPPs Linaridins cypemycin BGC0000582 RiPPs Linaridins grisemycin BGC0000583 RiPPs Cyanobactins patellin 2 BGC0000477 RiPPs Cyanobactins patellin 3 BGC0000477 RiPPs Cyanobactins patellin 6 BGC0000478 RiPPs Cyanobactins trunkamide BGC0000478 RiPPs Microcins microcin E492 BGC0000586 RiPPs Microcins microcin M BGC0000589 RiPPs Microcins microcin H47 BGC0000587 RiPPs Microviridins microviridin K BGC0000594 RiPPs Microviridins microviridin B BGC0000592 RiPPs Microviridins microviridin J BGC0000593 Terpenes Terpenes_1 astaxanthin BGC0000630 Terpenes Terpenes_1 zeaxanthin BGC0000656 Terpenes Terpenes_2 lolitrem BGC0001081 Terpenes Terpenes_2 aflatrem BGC0000629 Terpenes Terpenes_4 (-)-delta-cadinene BGC0000674 Terpenes Terpenes_4 (+)-T-muurolol BGC0000675 Terpenes Terpenes_5 nivalenol BGC0001278 Terpenes Terpenes_5 deoxynivalenol BGC0001278 Terpenes Terpenes_5 3-acetyldeoxynivalenol BGC0001278 Terpenes Terpenes_5 15-acetyldeoxynivalenol BGC0001278 BioRxiv 2018, X Terpenes Terpenes_5 neosolaniol BGC0001278 Terpenes Terpenes_5 calonectrin BGC0001278 Terpenes Terpenes_5 apotrichodiol BGC0001278 Terpenes Terpenes_5 isotrichotriol BGC0001278 Terpenes Terpenes_5 15-decalonectrin BGC0001278 Terpenes Terpenes_5 T-2 Toxin BGC0001278 Terpenes Terpenes_5 3-acetyl T-2 toxin BGC0001278 BGC0001277, Terpenes Terpenes_5 trichodiene BGC0001278 BGC0000930, Terpenes Terpenes_5 trichothecene BGC0000931 BGC0000702, BGC0000703, BGC0000704, BGC0000705, Saccharides Aminoglycosides I kanamycin BGC0000706 BGC0000719, BGC0000720, Saccharides Aminoglycosides I tobramycin BGC0000721 BGC0000709, BGC0000710, Saccharides Aminoglycosides I neomycin BGC0000711 Saccharides Aminoglycosides I lividomycin BGC0000708 Saccharides Aminoglycosides I paromomycin BGC0000712 Saccharides Aminoglycosides I ribostamycin BGC0000713 Saccharides Aminoglycosides II sisomicin BGC0000714 BGC0000696, Saccharides Aminoglycosides II gentamicin BGC0000697 BGC0000717, BGC0000718, Saccharides Aminoglycosides III streptomycin BGC0000724 Saccharides Aminoglycosides III 5'-hydroxystreptomycin BGC0000690 Other BGC0000652 Hybrids Terpenes_Polyketide_X napyradiomycin Other BGC0001083 Hybrids Terpenes_Polyketide_X merochlorin A Other BGC0001083 Hybrids Terpenes_Polyketide_X merochlorin B Other BGC0001083 Hybrids Terpenes_Polyketide_X merochlorin C Other BGC0001083 Hybrids Terpenes_Polyketide_X merochlorin D Other BGC0001083 Hybrids Terpenes_Polyketide_X deschloro-merochlorin A Other BGC0001083 Hybrids Terpenes_Polyketide_X deschloro-merochlorin B Other BGC0001083 Hybrids Terpenes_Polyketide_X isochloro-merochlorin B Other BGC0001083 Hybrids Terpenes_Polyketide_X dichloro-merochlorin B Other BGC0001078 Hybrids Terpenes_Polyketide_X furaquinocin A BioRxiv 2018, X Others Aminocoumarins coumermycin A1 BGC0000833 Others Aminocoumarins clorobiocin BGC0000832 Others Aminocoumarins novobiocin BGC0000834 BGC0000838, Others Aryl Polyenes flexirubin BGC0000839 Others Aryl Polyenes xanthomonadin BGC0000840 Others Aryl Polyenes APE Ec BGC0000836 Others Aryl Polyenes APE Vf BGC0000837 Others Others_X blasticidin BGC0000874 Others Others_X arginomycin BGC0000883 Others Others_X mildiomycin BGC0000882 Others Nucleosides caprazamycin BGC0000875 Others Nucleosides Caprazamycin aglycon BGC0000875 Others Nucleosides A-90289 A BGC0000872 Others Nucleosides A-90289 B BGC0000872 Others Nucleosides liposidomycin A BGC0001076 Others Nucleosides liposidomycin B BGC0001076 Others Nucleosides liposidomycin C BGC0001076 Others Nucleosides liposidomycin G BGC0001076 Others Nucleosides liposidomycin H BGC0001076 Others Nucleosides liposidomycin K BGC0001076 Others Nucleosides liposidomycin L BGC0001076 Others Nucleosides liposidomycin M BGC0001076 Others Nucleosides liposidomycin N BGC0001076 Others Nucleosides liposidomycin Z BGC0001076

BioRxiv 2018, X Text S2. BiG-SCAPE classes weight optimization Weight optimization for BiG-SCAPE was performed as described in the Online Methods with the following characteristics.

Within DSS, two sets of “anchor domains” were used: an initial set with Condensation Domain, PF00668; Beta-ketoacyl synthase, N-terminal domain (PF00109), Beta-ketoacyl synthase, C-terminal domain, (PF02801) and Terpene synthase, N-terminal domain (PF01397) and an extended set that included AMP-binding enzyme (PF00501), Lanthionine synthase C-like protein (PF05147), Chalcone and stilbene synthases, N-terminal domain (PF00195) and Chalcone and stilbene synthases, C-terminal domain (PF02797).

Base correlation uses manually assigned default BiG-SCAPE weights: Jw = 0.2, DDSw = 0.75, GKw = 0.05, AIw = 0.0, as well as anchorboost = 2.0. A value of 1 for anchorboost means no change in perceived proportion of anchor domains. Best correlation: Ranges used for the optimization are: Jw, DDSw, Gkw, AIw ∈ [0, 1] with the condition that Jw + DDSw + GKw + AIw = 1. anchorboost ∈ [1,4]. Optimization step: 0.01 except for anchorboost (0.5). Best weights are in the format Jw, DDSw, GKw, AIw. Numbers in parenthesis are the P-values calculated by the Pearson function from Python.

Intra-groups: Each data point is comprised of pairs of nodes where both belong to the same group, for each group within the class of the Curated Compound Groups table. Inter groups: Data points comprise pairs of nodes where each belong to the same class (but might belong to different groups).

Additionally, only data points with at least two predicted domains were considered. Selected weights are highlighted in bold. See also Figure S6.

BiG-SCAPE Classes / Curated Compound classes: • PKS type I (Family: Polyketides, Subfamily: Type I Mechanism) • PKS Other Types (Family: Polyketides, Subfamily: Type II Mechanism, Type III Mechanism, Other Mechanism) • NRPs • RiPPs • Polyketides/NRP hybrids • Saccharides • All other families (Alkaloids, Terpenes, Other Hybrids, Others)

BioRxiv 2018, X

Polyketides Type I Mechanism

Intra groups (188 points) Anchor domains:4 Anchor domains:8 Base correlation 0.2647 (0.0002) 0.2660 (0.0002) Best correlation 0.3056 (1.99e-05) 0.3056 (1.99e-05) Best weights 0.63, 0.31, 0.06, 0.0, anchorboost: 1.0 0.63, 0.31, 0.06, 0.0, anchorboost: 1.0

Inter groups (6027 points) Anchor domains:4 Anchor domains:8 Base correlation 0.4676 (0.0) 0.4698 (0.0) Best correlation 0.4863 (0.0) 0.4863 (0.0) Best weights 0.22, 0.76, 0.02, 0.0, anchorboost: 1.0 0.22, 0.76, 0.02, 0.0, anchorboost: 1.0

Polyketides Type II Mechanism, Type III Mechanism, Other Mechanism

Intra groups (17 points) Anchor domains:4 Anchor domains:8 Base correlation 0.2623 (0.3089) 0.2495 (0.3340) Best correlation 0.4552 (0.0663) 0.4552 (0.0663) Best weights 0.0, 0.0, 0.0, 1.0, anchorboost: 1.0 0.0, 0.0, 0.0, 1.0, anchorboost: 1.0

Inter groups (242 points) Anchor domains:4 Anchor domains:8 Base correlation 0.7294 (1.81e-41) 0.7296 (1.73e-41) Best correlation 0.7659 (5.97e-48) 0.7674 (3.14e-48) Best weights 0.0, 0.32, 0.0, 0.68, anchorboost: 4.0 0.0, 0.33, 0.0, 0.67, anchorboost: 4.0

BioRxiv 2018, X NRPs

Intra groups (286 points) Anchor domains:4 Anchor domains:8 Base correlation 0.6074 (3.08e-30) 0.6110 (1.16e-30) Best correlation 0.6606 (3.02e-37) 0.6556 (1.59e-36) Best weights 0.0, 1.0, 0.0, 0.0, anchorboost: 4.0 0.0, 1.0, 0.0, 0.0, anchorboost: 4.0

Inter groups (6760 points) Anchor domains:4 Anchor domains:8 Base correlation 0.7469 (0.0) 0.7473 (0.0) Best correlation 0.7714 (0.0) 0.7678 (0.0) Best weights 0.0, 1.0, 0.0, 0.0, anchorboost: 4.0 0.01, 0.98, 0.0, 0.01, anchorboost: 3.5

RiPPs

Intra groups (16 points) Anchor domains:4 Anchor domains:8 Base correlation 0.7812 (0.0003) 0.7832 (0.0003) Best correlation 0.8846 (5.34e-06) 0.8845 (5.37e-06) Best weights 0.04, 0.43, 0.53, 0.0, anchorboost: 4.0 0.04, 0.43, 0.53, 0.0, anchorboost: 1.5

Inter groups (157 points) Anchor domains:4 Anchor domains:8 Base correlation 0.8696 (2.17e-49) 0.8726 (4.02e-50) Best correlation 0.8867 (8.36e-54) 0.8906 (6.29e-55) Best weights 0.28, 0.71, 0.0, 0.01, anchorboost: 1.0 0.23, 0.74, 0.0, 0.03, anchorboost: 4.0

BioRxiv 2018, X PKS/NRPs hybrids

Intra groups (37 points) Anchor domains:4 Anchor domains:8 Base correlation -0.0209 (0.9019) 0.0108 (0.9490) Best correlation 0.3507 (0.0332) 0.3507 (0.0332) Best weights 0.0, 0.0, 0.33, 0.67, anchorboost: 1.0 0.0, 0.0, 0.33, 0.67, anchorboost: 1.0

Inter groups (186 points) Anchor domains:4 Anchor domains:8 Base correlation 0.7165 (1.33e-30) 0.7192 (6.40e-31) Best correlation 0.7418 (9.26e-34) 0.7418 (9.26e-34) Best weights 0.0, 0.78, 0.06, 0.16, anchorboost: 1.0 0.0, 0.78, 0.06, 0.16, anchorboost: 1.0 Chosen weights: 0.0, 0.78, 0.0, 0.22, anchorboost: 1.0. These weights were chosen due to the Goodman-Kruskal index being dropped in the final version of BiG-SCAPE.

Saccharides

Intra groups (80 points) Anchor domains:4 Anchor domains:8 Base correlation 0.3860 (0.0004) 0.3841 (0.0004) Best correlation 0.4848 (5.16e-06) 0.4848 (5.16e-06) Best weights 0.0, 0.0, 0.21, 0.79, anchorboost: 1.0 0.0, 0.0, 0.21, 0.79, anchorboost: 1.0

Inter groups (186 points) Anchor domains:4 Anchor domains:8 Base correlation 0.5703 (9.57e-17) 0.5689 (1.16e-16) Best correlation 0.6390 (8.05e-22) 0.6390 (8.05e-22) Best weights 0.0, 0.0, 0.17, 0.83, anchorboost: 1.0 0.0, 0.0, 0.17, 0.83, anchorboost: 1.0 Chosen weights: 0.0, 0.0, 0.0, 1.0, anchorboost: 1.0. These weights were chosen due to the Goodman- Kruskal index being dropped in the final version of BiG-SCAPE

BioRxiv 2018, X All other groups (Alkaloids, Terpenes, Other Hybrids, Others)

Intra groups (262 points) Anchor domains:4 Anchor domains:8 Base correlation -0.1182 (0.0559) -0.1163 (0.0599) Best correlation 0.0751 (0.2252) 0.0751 (0.2252) Best weights 0.0, 0.0, 1.0, 0.0, anchorboost: 1.0 0.0, 0.0, 1.0, 0.0, anchorboost: 1.0

Inter groups (774 points) Anchor domains:4 Anchor domains:8 Base correlation 0.5355 (1.17e-58) 0.5367 (5.90e-59) Best correlation 0.5363 (7.07e-59) 0.5392 (1.32e-59) Best weights 0.37, 0.57, 0.06, 0.0, anchorboost: 4.0 0.01, 0.97, 0.02, 0.0, anchorboost: 4.0

Due to the lack of BGCs related to the curated Terpene class, the base set of weights were chosen for for this BiG-SCAPE class: 0.2, 0.75, 0.0, 0.05, anchorboost: 2.0

BioRxiv 2018, X Figure S1. Scatterplots with weight optimization results. Filled colored points represent pairs of BGCs within the same curated Compound Group from Table S1 while hollow points represent pairs of BGCs belonging from different groups but the same Curated Compound Class.

BioRxiv 2018, X

BioRxiv 2018, X Text S3. MIBiG network cutoff analysis We applied a targeted attack to the networks produced by running BiG-SCAPE on the MIBiG database in order to identify the most suitable cutoff value where to filter the network in order to proceed to clustering analysis by default at the gene cluster clan/family level. The targeted attack removes the edges above a certain cutoff value. For each iteration, we calculated the number of nodes, graph density and identified the connected components. Isolated vertices (BGCs) were removed for the component identification. We evaluated the resulting networks in terms of number of vertices/edges lost and the size of the connected components that emerged during the attack. A good threshold, is the one that reduces the number of edges while maximizing the number of vertices and minimizing the impact to the network structural integrity. Figure S2 shows the dynamics and impact of the different filtering thresholds applied to the different BGC training networks, being a cutoff of 0.75 the value that globally satisfies the conditions described previously. All the analyses have been performed in R (Team & Others, 2013) using the igraph package (Csardi & Nepusz, 2006) for the network analyses and ggplot2 (Wickham, Chang, & Others, 2008) for plotting.

BioRxiv 2018, X Figure S2. Targeted attack of the MIBiG network

Targeted attack to the different training networks. Panel on top represents the proportion of nodes and edges present in the resulting network after applying the different thresholds. Panel on bottom shows the proportion of components with a number of members larger than the size of the defined component size. Vertical lines represent the threshold selected (0.75) BioRxiv 2018, X Text S4. Clustering analysis on the MIBiG network We used the filtered (cutoff = 0.75) MIBiG networks from the previous section to identify the best method that identifies clusters of vertices that might conform potential BGC families. To evaluate the clustering results, we calculated the entropy of each cluster based on the compounds that each BGC (vertex) produce. A good clustering method will have a low entropy while maximizing the size of the cluster. Figures S3 and S4 show the results of applying the different clustering methods to the different training networks (glocal and global). The Affinity Propagation clustering method showed the most sensible results, producing clusters with low entropy and average size. All the other methods tested (Table S2) resulted in clusters present in the principal quadrant, indicating that these methods were not able to partition the data properly and lumped together vertices (large size) that encode for different type of compounds (large entropy). Based on these results, Affinity Propagation was chosen as the clustering algorithm in BiG-SCAPE.

Figure S3. Clustering analysis on the MIBiG network using glocal mode

Overview of the clustering results for the glocal aligned training networks. Each facet shows the relationship between the entropy and the size of the clusters for each clustering method and training network. Grey horizontal and vertical lines are the intersection of the maximum values for the Affinity Propagation (highlighted in red) clustering method for each training network. A good clustering result should be restricted to the third quadrant. AP: affinity propagation; FG: fast greedy; IM: infomap; LV: Louvain; LP; label propagation; MCL: markov clustering; TOM: topological overlap matrix; WT: walktrap.

BioRxiv 2018, X Figure S4. Clustering analysis on the MIBiG network using global mode

Overview of the clustering results for the global aligned training networks. Each facet shows the relationship between the entropy and the size of the clusters for each clustering method and training network. Grey horizontal and vertical lines are the intersection of the maximum values for the Affinity Propagation (highlighted in red) clustering method for each training network. A good clustering result should be restricted to the third quadrant. AP: affinity propagation; FG: fast greedy; IM: infomap; LV: Louvain; LP; label propagation; MCL: markov clustering; TOM: topological overlap matrix; WT: walktrap.

BioRxiv 2018, X Table S2. Clustering methods used to identify potential BGC families in the training networks.

Clustering method Reference Notes

Affinity propagation (Frey & Dueck, 2007) scikit-learn (Pedregosa et al., (AP) 2011)

Fast greedy (FG) (Clauset, Newman, & Moore, 2004) igraph

Label propagation (LP) (Raghavan, Albert, & Kumara, 2007) igraph

Walktrap (WT) (Pons & Latapy, 2005) igraph

Infomap (IM) (Martin Rosvall & Bergstrom, 2008; M. igraph and original Rosvall, Axelsson, & Bergstrom, 2009) implementation

Louvain community (Blondel, Guillaume, & Lambiotte, 2008) igraph and original (LV) implementation

Markov Clustering (Van Dongen, 2000) original implementation (MCL)

Topological Overlap (Zhang & Horvath, 2005) WGCNA package (Langfelder & Matrix (TOM) Horvath, 2008)

BioRxiv 2018, X Figure S5. Alluvial plot depicting BiG-SCAPE’s assignation of MIBiG BGCs to Gene Cluster Families against the manual curated groups in Table S1. BiG-SCAPE was run on the MIBiG database with hybrid mode disabled to prevent BGCs appearing in more than one BiG-SCAPE class, and cutoff value c=0.75 (after the analysis in Text S3). First column: Compound Classes from curated table; second column: Compound Groups from curated table; third column: global mode (purity: 0.916); fourth column: glocal mode (purity: 0.888) For both modes, purity was calculated using only GCFs of size 2 and larger (57/114 for global mode; 56/114 for glocal mode) and the following formula: � = ! max |� ∩ � | where N is the number of ! ! ! ! ! BGCs analyzed (i.e. all from GCFs of size 2 and larger), Ω = {�!, �!, … , �!} is the set of GCFs and �� = { �!, �!, … , �!} is the set of curated groups. In other words, the number of members of the most abundant curated group label in each GCF, for each GCF, is summed and normalized.

BioRxiv 2018, X

BioRxiv 2018, X Text S5. Actinobacteria Genomes Data Set The original set of Actinobacteria genomes to be processed by antiSMASH was obtained on 2017-02-03 with the following query in the NCBI website (1,668 results): ("whole genome shotgun sequencing project"[title] OR "complete genome"[title]) AND (Actinobacteria[Organism] NOT (Propionibacteriales[Organism] OR Micrococcales[Organism] OR Corynebacteriales[Organism] OR Bifidobacteriales[Organism]) OR Nocardiaceae[Organism]) AND (bacteria[filter] AND biomol_genomic[PROP] AND ddbj_embl_genbank[filter]) NOT scaffold[title]

The extended set of genomes selected to be processed by antiSMASH was obtained by using the following query in the NCBI website on 2018-01-30 (2,891 results): ("whole genome shotgun sequencing project"[title] OR "complete genome"[title]) AND (Actinobacteria[Organism] NOT (Propionibacteriales[Organism] OR Micrococcales[Organism] OR Corynebacteriales[Organism] OR Bifidobacteriales[Organism]) OR Nocardiaceae[Organism]) AND (bacteria[filter] AND biomol_genomic[PROP] AND ddbj_embl_genbank[filter]) NOT (scaffold[title] OR plasmid[title] OR segment[title])

BioRxiv 2018, X Other BiG-SCAPE information

Table S3. Compute times for BiG-SCAPE calculations for datasets of multiple sizes.

BGCs Time in seconds (HH:MM:SS)

10 62.2 (00:01:02)

100 102.8 (00:01:42)

1000 897.5 (00:14:57)

10000 15244.6 (04:14:04) Compute times for different numbers of randomly picked BGCs from the Expanded Actinobacteria set. BGCs were chosen as complete (i.e. not flagged by antiSMASH as ‘contig_edge’. BiG-SCAPE was run with default parameters (as of May 2018) using 16 threads (--c 16) Each run was calculated from scratch.

Table S4. Default Anchor Domain List used in the DSS index.

PF00668 Condensation domain [NRPS]

PF00501 AMP-binding enzyme [NRPS]

PF00109 Beta-ketoacyl synthase N-terminal [PKS]

PF02801 Beta-ketoacyl synthase C-terminal [PKS]

PF01397 Terpene synthase, N-terminal domain (Terpene_synth) [Terpene]

PF03936 Terpene synthase family, metal binding domain (Terpene_synth_C) [Terpene]

PF00195 Chalcone and stilbene synthases, N-terminal domain (Chal_sti_synt_N)

PF02797 Chalcone and stilbene synthases, C-terminal domain (Chal_sti_synt_C)

PF05147 Lanthionine synthetase C-like protein (LANC_like) [lantipeptide/RiPP]

PF00494 Squalene/phytoene synthase (SQS_PSY) [Terpene]

PF00432 Prenyltransferase and squalene oxidase repeat (Prenyltrans) [Indole alkaloids]

PF02624 YcaO cyclodehydratase, ATP-ad MG2+-binding (YcaO) [RiPP]

BioRxiv 2018, X Table S5. Logic for internal class separation.

antiSMASH annotation BiG-SCAPE class

t1pks PKS I

transatpks, t2pks, t3pks, otherks, hglks PKS other

nrps NRPS

lantipeptide, thiopeptide, bacteriocin, linaridin, cyanobactin, glycocin, RiPPs LAP, lassopeptide, sactipeptide, bottromycin, head_to_tail, microcin, microviridin, proteusin and combinations of these

amglyccycl, oligosaccharide, cf_saccharide and combinations of these Saccharides

terpene Terpene

{t1pks, transatpks, t2pks, t3pks, otherks, hglks} + nrps PKS/NRPS Hybrids

acyl_amino_acids, arylpolyene, aminocoumarin, ectoine, butyrolactone, Others nucleoside, melanin, phosphoglycolipid, phenazine, phosphonate, other, cf_putative, resorcinol, indole, ladderane, PUFA, furan, hserlactone, fused, cf_fatty_acid, siderophore, blactam and any combined annotation

or < mix >

Internal BGC classification used by BiG-SCAPE to separate the analysis using antiSMASH 4 annotations. If using the hybrids mode (default), BiG-SCAPE will also assign BGCs with combined annotations from antiSMASH to each individual class (e.g. a BGC annotated as “t1pks-terpene” will go to the Others, PKS I and Terpene BiG-SCAPE classes). GenBank files without antiSMASH annotations will be classified as Others.

BioRxiv 2018, X Figure S6. Flowchart of BiG-SCAPE components. Downstream analysis includes: SVG figures for every cluster; text “network files” that include pairwise distance; a file with annotations for every cluster analyzed (one file per BiG-SCAPE class) and Gene Cluster Family (GCF) labeling.

BioRxiv 2018, X Figure S7. Verified clusters encoding benarthin biosynthesis. Signature gene(s) highlighted.

BioRxiv 2018, X Figure S8. Verified clusters encoding detoxin biosynthesis. Signature gene(s) highlighted.

Figure S9. Verified clusters encoding rimosamide biosynthesis. Signature gene(s) highlighted.

BioRxiv 2018, X

Figure S10. Rimosamide-related GCFs in Fig 3a.

Fig. S10a. BGCs in the light-turquoise GCF contain another tauD domain upstream of the rimosamide BGC. Because another NRPS gene cluster is found close enough to the rimosamide gene clusters, antiSMASH predicts a very large ‘merged’ cluster in cases when there is no contig break between the rimosamide cluster and the adjacent NRPS gene cluster. The light-turquoise GCF contains clusters not related to the target rimosamide BGCs due to them being clustered with true Rimosamide BGCs through 'linking' regions (facilitated by the glocal mode). The BioRxiv 2018, X large clusters arise when antiSMASH predicts a very large 'merged' cluster in cases when more Core Biosynthetic Genes are found during the stage of border expansion (in this case, more NRPS genes; light blue). Moreover, even though this network was filtered to only include BGCs that contained the TauD domain, some non-Rimosamide BGCs also contain a second copy of this domain and therefore were added to the network.

BioRxiv 2018, X

Fig. S10b. BGCs in the dark-turquoise GCF contain only clusters with the signature NRPS, PKS/NRPS Hybrid, tauD gene cluster architecture.

BioRxiv 2018, X Figure S11. Phylogenomic reconstruction of 103 complete Streptomyces genomes with outgroups Catenulispora acidiphila CP001700.1 and Salinispora arenicola CP000850.1.

BioRxiv 2018, X Figure S12. BiG-SCAPE / CORASON GCFs of the closed Streptomyces genomes.

BiG-SCAPE/CORASON GCFs (columns) sorted by frequency of appearance dispersed across the phylogenetic reconstruction of 103 Streptomyces closed genomes. Species names can be found clearer in the large phylogenetic tree in Fig. S11. BGCs present in less than 5 species are not shown in the heatplot (2036 out of 3184 total BGCs). Only species were used to count the size of each GCF (i.e. BGCs from the same species that were clustered in the same GCF only incremented the size of that GCF by one). BiG-SCAPE network was calculated using parameters: c=0.3, hybrid mode: off, distance mode: glocal and --mibig (1.3). MIBiG BGCs are not shown in this analysis. BioRxiv 2018, X

Figure S13. tauD Actinobacteria EvoMining expansions tree.

A branch containing several tauD homologues identified as part of MIBiG BGCs is shown in dark gray and includes tauD homologues from the known rimosamide and detoxin BGCs, indicated by turquoise and beige circles. The tree was generated using tauD from E.coli as the query for CORASON analysis on our Actinobacterial genome database. Colors in the tree match the BiG-SCAPE-defined GCFs shown in Fig. 5. This tree and metadata are available for further exploration at microreact in the site https://microreact.org/project/H1UuQE0qm BioRxiv 2018, X Table S6. Occurrences of tauD homologues in secondary metabolism BGCs as reported in MIBiG.

No MIBiG Compound Class Producer Organism

1 BGC0000653_ADO85576 pentalenolactone Terpene Streptomyces arenae

Streptomyces avermitilis MA-4680 = 2 BGC0000678_BAC70706 pentalenolactone Terpene NBRC 14893

3 BGC0000163_ACR50790 tetronasin Polyketide Streptomyces longisporoflavus

4 BGC0000961_ABC36162 bactobolin NRP / Polyketide Burkholderia thailandensis E264 5 BGC0000287_AAG05698 2-amino-4-methoxy-trans-3-butenoic acid NRP PAO1

6 BGC0000846_ctg1_orf9 tabtoxin Other Pseudomonas syringae

7 BGC0001183_AGC09526 lobophorin Polyketide Streptomyces sp. FXJ7.023

8 BGC0001156_ADD83004 platencin Terpene Streptomyces platensis

9 BGC0001140_ACO31277 platensimycin / platencin Terpene Streptomyces platensis

10 BGC0001140_ACO31282 platensimycin / platencin Terpene Streptomyces platensis BGC0000715_ABW8779 11 5 spectinomycin Saccharide Streptomyces spectabilis

12 BGC0001205_KGO40485 communesin Polyketide Penicillium expansum

13 BGC0001205_KGO40482 communesin Polyketide Penicillium expansum

14 BGC0001183_AGC09525 lobophorin Polyketide Streptomyces sp. FXJ7.023

15 BGC0000654_ABB69741 phenalinolactone Saccharide / Terpene Streptomyces sp. Tu6071

16 BGC0001070_CAN89617 kirromycin NRP / Polyketide Streptomyces collinus Tu 365

BioRxiv 2018, X Microbiology and metabolomics methods

Text S6. Cultivation of actinomycetes for MS-based metabolomics. All strains analyzed for metabolomics were grown on four media types: arginine/glycerol/salts, mannitol/soyflour, ISP medium 4, or glycerol/sucrose/beef extract/casamino acids. After 10 days of growth, plates were frozen, then thawed and pressed to release spent liquid media. Media was then filtered and extracted using 30 mg Supel-Select HLB SPE cartridges (Supelco) and resuspended to a concentration of approximately 2 mg/mL in 5% acetonitrile prior to LC-MS analysis.

Text S7. Acquisition and analysis of LC-MS metabolomics data. All LC-MS/MS analyses were performed using an Agilent 1150s HPLC coupled with a Q-Exactive mass spectrometer (Thermo Fisher Scientific). Reversed phase chromatography was performed at a 200 µL/min flow rate on a Phenomenex Kinetex C18 RP-HPLC column (150 mm x 2.1 mm i.d., 2 µm particle size, 100 Å pore size). Mobile phase A was water with 0.1% formic acid and mobile phase B was acetonitrile with 0.1% formic acid. The mass spectrometry parameters were as follows: scan range 250-3750 m/z, resolution 35,000, scan rate ~six per second. The top five most intense ions in each full spectrum were targeted for fragmentation using a collision energy setting of 25 eV for Higher-energy Collisional Dissociation (HCD). Tandem MS/MS data was analyzed using spectral networking as previously described. Signals detected in multiple strains were determined to be the same ion if the observed accurate masses were within 4 ppm and fragmentation cosine similarity scores were above 0.75, yielding 5,824 ions detected in two or more strains.

Metabologenomic correlations Strains with metabolomics data were referenced against the BiG-SCAPE GCF absence/presence matrices. GCFs that had representative gene clusters in two or more strains were considered correlatable and entered into the correlations dataset. The different BiG-SCAPE modes and cutoffs produced variable numbers of correlatable GCFs and thus different numbers of ion-GCF hypotheses (Table S7).

Table S7. Size of each correlation round resulting from various BiG-SCAPE outputs.

Correlation round Correlatable GCFs Ion-GCF hypotheses

glocal 0.30 4,474 26,056,576

glocal 0.50 5,067 29,510,208

global 0.30 4,237 24,676,288

global 0.50 5,176 30,145,024

BioRxiv 2018, X Fig. S14. Binary correlation score chart between ions and GCFs used for metabologenomics. Decoy databases for each round of correlation were made by scrambling the Boolean arrays of ion and GCF detection patterns across strains to make randomized datasets of the same size. Ion-GCF pairs in correlation rounds and decoy datasets were assigned correlation scores by applying a binary 4 scoring method as previously described (Fig. S14). A 1% FDR cutoff was calculated for each correlation round output via comparison to its decoy database score distribution (data not shown). The nine ion-GCF pairs in a manually curated “golden dataset” were located within each round and their correlation scores were compared to the calculated FDR for each round (Table S8). Overall, glocal mode GCF annotation outperformed the global mode outputs at the analyzed cutoff levels, with 5/9 golden dataset correlations outperforming a 1% FDR in the glocal 0.50 correlation round.

Table S8. Correlation scores for known ion-GCF pairs in the dataset across the four correlation rounds. The correlation score is highlighted in yellow if it was higher than the estimated 1% FDR threshold.

BioRxiv 2018, X

Text S8. Metabolic labeling of detoxins N1, N2, P1, P2, and P3 with stable isotope-labeled amino acids. Streptomyces spectabilis Dietz NRRL 2792 (ATCC 27741) was obtained from the American Type Culture Collection (ATCC) and was grown on 60 mm solid agar media Petri plates containing arginine/glycerol/salts medium (1 L of DI water, 15 g agar, 1 g arginine, 12.5 g glycerol, 1 g potassium phosphate dibasic, 1 g sodium chloride, 0.5 g magnesium sulfate heptahydrate, 10 mg iron (II) sulfate hexahydrate, 1 mg copper(II) sulfate pentahydrate, 1 mg manganese(II) sulfate monohydrate, and 1 mg zinc sulfate heptahydrate). Amycolatopsis jejuensis NRRL B-24427 was obtained from the Agricultural Research Service of the United States Department of Agriculture and was grown on solid agar media Petri plates containing mannitol/soyflour medium (1 L of DI water, 15 g agar, 20 g D-mannitol, 20 g soy flour). For all metabolic labeling experiments, media was supplemented with 1 mL of a 10 mM solution 13 of each stable isotope labeled amino acid. Stable isotope labeled amino acids used were C6-isoleucine, 15 d8, N-phenylalanine, d7-proline, 2,5,5-d3-proline, phenyl-d4-tyrosine, d8-valine, 2-d1-valine, and 3-d1- valine. After five days incubation in the presence of stable isotope labeled amino acids, plates were frozen overnight at -20 °C, thawed, and pressed to release spent liquid media. Extracellular secondary metabolites were extracted using 30 mg Supel-Select HLB SPE cartridges (Supelco) and eluted with 90% acetonitrile. Samples were dried, resuspended in 5% acetonitrile, and analyzed by reversed-phase LC-MS/MS on a Q Exactive mass spectrometer as described above. The Methods used for LC-MS data acquisition on the Q Exactive were the same except for occasional parameter adjustments made to target major unnatural isotope ions for optimal fragmentation.

BioRxiv 2018, X Metabolomics data and structural analysis

Structural characterization of detoxin S1 from the P450/enoyl clade. The ‘P450/enoyl clade’ is characterized by the presence of two P450 genes and an enoyl-CoA hydratase/isomerase within its BGCs. Analysis of MS data from extracts of Streptomyces sp. NRRL S-

325, which contains a BGC within this clade, revealed the fatty acid amide detoxin S1 (1) with an [M+H]+ of m/z 518.324. Analysis of tandem MS fragmentation suggested that this compound featured a heptanamide side chain, the largest such alkyl group on a detoxin or rimosamide described to date. Detailed analysis of tandem MS fragmentation data for 1 and its comparison to tandem MS data for the known detoxin B3 assisted in structural assignment, both providing strong evidence for the alkyl side chain (Figs. S15 and S16). Specifically, the neutral loss from the iminium fragment ion at m/z 232.170 to that of m/z 120.081 in the detoxin S1 data and a difference in m/z 42.048 between corresponding tripeptide fragments and molecular ions in the detoxin S1 to detoxin B3 comparison strongly supported the unique incorporation of this aliphatic moiety (Fig. S16). BioRxiv 2018, X

Fig. S15. Tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S-325.

Tandem MS fragmentation supports the assignment of valine, phenylalanine, heptanoyl, and modified proline residues in the proposed structure of detoxin S1. BioRxiv 2018, X

Fig. S16. Comparison of the tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S-325 with the tandem MS spectrum of detoxin B3.

Matching fragments between detoxin S1 and detoxin B3 in the low m/z range support assignment of valine, phenylalanine, and the modified proline common among this class. An increase in m/z 42.048 from fragments of detoxin B3 to those of detoxin S1 in the higher m/z range support an increase in C3H6 for the detoxin S1 side chain, suggesting a hydrocarbon chain. BioRxiv 2018, X

Structural characterization of detoxins N1 and N2 from the supercluster clade. The ability for Streptomyces spectabilis NRRL 2792 to produce detoxins was predicted solely based on the phylogenetic grouping of its tauD query gene adjacent detoxin/spectinomycin BGC superclusters but in the absence of full genomic data or a complete detoxin BGC. Tandem mass-spectrometry analysis of a solid agar media growth of S. spectabilis NRRL 2792 indicated production of six detoxin-like natural products, including [M+H]+ ions of m/z 464.240 and m/z 522.247, both with the incorporation of tyrosine unique among the detoxins and rimosamides, named detoxins N1 (2; Fig. S17-18) and N2 (3; Fig. S19-20). Evidence for a related detoxin with a putative propionylation at the modified proline was also observed with an [M+H]+ ion of m/z 577.286, though full characterization was not possible due to the ion’s low abundance. LC-MS analysis of cultures supplemented with stable isotope–labeled amino acids confirmed structural predictions based on analysis of the BGC and tandem MS data: 2 and 3 are biosynthesized from condensations of isoleucine and tyrosine to a central, modified proline (Figs. S17- S20), with 3 including an acetoxy modification of proline, likely at position 3 based on comparison to other detoxins. This was supported by the incorporation of six deuterons when labeling with d7-proline, and complete retention of all deuterium labeling when feeding 2,5,5-d3-proline (Fig. S20). The formamido group is established in both analogs based on analysis of spectral data and its comparison to known analogs, with the iminium ion at m/z 164.071 giving the strongest support for a formyl group forming an amide bond with the phenylalanine substituent.

BioRxiv 2018, X

Fig. S17. Tandem MS spectrum of detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792.

Tandem MS fragmentation supports the assignment of isoleucine, tyrosine, formyl, and modified proline residues in the detoxin N1 chemical structure. BioRxiv 2018, X 13 Fig. S18. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792.

All ions indicative of labeled amino acid incorporation in MS1 spectra were confirmed by analysis of fragmentation patterns in the corresponding tandem MS data. BioRxiv 2018, X

Fig. S19. Tandem MS spectrum of detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792.

Tandem MS fragmentation supports the assignment of isoleucine, tyrosine, formyl, and acetoxy-modified proline residues in the detoxin N2 chemical structure. BioRxiv 2018, X 13 Fig. S20. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792.

Combined with molecular formula analysis and comparison to known detoxins and rimosamides, the loss of one deuteron from d7-proline, retention of all deuterons in 2,5,5-d3-proline in metabolic labeling experiments, and comparison to known analogs supports assignment of the acetoxy group at the 3 position on proline. All ions indicative of labeled amino acid incorporation in MS1 spectra were confirmed by analysis of fragmentation patterns in the corresponding tandem MS data. BioRxiv 2018, X

Structural characterization of detoxins P1, P2, and P3 from the Amycolatopsis/P450 clade. Tandem MS analysis of solid agar Amycolatopsis jejuensis NRRL B-24427 cultures indicated a pair of + hydroxylated detoxin-like isomers with [M+H] of m/z 506.287, named detoxins P1 (4; Fig. S21) and P2 + (5; Fig. S26), as well as detoxin P3, a closely related analog free of hydroxylation with [M+H] of m/z 490.291 (6; Fig. S31). The latter of the three was not observed in the molecular network but was instead identified through analysis of raw spectral data. As before, validation of amino acid assignments observed in MS2 fragmentation data was achieved through several metabolic feeding experiments using stable isotope-labeled amino acids (Figs. S22, S27, and S32). Tandem MS analyses of 4, 5, and 6 revealed incorporations of proline, while deuterium-labeled valine was incorporated at two positions in each variant: modified to an isovaleryl residue in all analogs, incorporated directly as valine in compounds 4 and 6 and modified to hydroxyvaline in compound 5 (Figs. S23, S33, and S28, respectively). In all incorporations within the Amycolatopsis detoxins, labeled valine was observed with the loss of a deuteron, regardless of the presence of hydroxylation in the final structure. We hypothesize that the P450 enzyme encoded in the Amycolatopsis jejuensis NRRL B-24427 BGC (see Fig. 5) oxidizes all incorporated valines, with a subsequent reduction of all but the hydroxylvaline residue observed in compound 5. The P450 hydroxylation was determined to occur in all three analogs at the terminal methyl group of valine whether the hydroxy group was retained in the final structure or not. This conclusion is based on MS analysis of Amycolatopsis jejuensis NRRL B-24427 culture extracts following feeding of 2d1- and 3d1-valine (Figs. S24-S25, S29-S30). Compound 4 is predicted to feature a tyrosine based on analysis of its tandem MS fragmentation (the observation of the m/z 136.076 tyrosine-derived imine fragment ion being key to this assignment), while incorporation of phenylalanine 15 was observed in compounds 5 and 6 when cultures were fed d8, N-phenylalanine. Neither incorporation of labeled and hydroxylated phenylalanine nor labeled tyrosine were observed in compound 4 likely due to the compound’s low overall production levels. BioRxiv 2018, X

Fig. S21. Tandem MS spectrum of detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B- 24427.

Tandem MS fragmentation supports the assignment of valine, tyrosine, isovaleryl, and modified proline residues in the detoxin P1 chemical structure. The m/z 136.076 and m/z 305.150 ions are key in assignment of hydroxylation on the aryl ring and not on the isovaleryl residue. BioRxiv 2018, X

Fig. S22. Stable isotope labeled-amino acid incorporation of d7-proline and d8-valine with loss of one deuteron in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

All ions indicative of labeled amino acid incorporation in MS1 spectra were confirmed by analysis of fragmentation patterns in the corresponding tandem MS data. Though phenyl-d4-tyrosine incorporation was not observed, incorporation of tyrosine is proposed based on molecular formula and tandem MS analysis as well as comparison to known detoxins and rimosamides. The loss of one deuteron from d8-valine in the metabolic labeling experiment is likely due to oxidation and reduction reactions carried out by a P450 enzyme in the Amycolatopsis jejuensis NRRL B-24427 detoxin BGC. One of the oxidations is retained in the valine of detoxin P2, but these are fully reduced in detoxins P1 and P3. BioRxiv 2018, X

Fig. S23. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of d8-valine-labeled detoxin P1. b, Predicted fragmentation indicates labeled and unlabeled valine and isovaleryl residue incorporation with loss of one deuteron likely due to oxidation and reduction by a P450 enzyme. BioRxiv 2018, X

Fig. S24. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of 2d1-valine-labeled detoxin P1. b, Fragmentation with retention of deuterium at the 2 position of valine support direct incorporated of valine and valine incorporated as an isovaleryl residue, both without oxidation at position 2. BioRxiv 2018, X

Fig. S25. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of 3d1-valine-labeled detoxin P1. b, Fragmentation with retention of deuterium at the 3 position of valine support direct incorporated of valine and valine incorporated as an isovaleryl residue, both without oxidation at position 3. BioRxiv 2018, X

Fig. S26. Tandem MS spectrum of detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B- 24427.

Tandem MS fragmentation supports the assignment of hydroxyvaline, phenylalanine, isovaleryl, and modified proline residues in the detoxin P2 chemical structure. BioRxiv 2018, X 15 Fig. S27. Stable isotope labeled-amino acid incorporation of d7-proline, d8, N-phenylalanine, and d8- valine with loss of one deuteron in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B- 24427.

All ions indicative of labeled amino acid incorporation in MS1 spectra were confirmed by analysis of fragmentation patterns in the corresponding tandem MS data. The loss of one deuteron from d8-valine in the metabolic labeling experiment is due to hydroxylation of the valine methyl group and a putative oxidation and subsequent reduction of the methyl group in the isovaleryl residue. These transformations are further elucidated through other metabolic labeling experiments presented here. BioRxiv 2018, X

Fig. S28. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of d8-valine-labeled detoxin P2. b, Predicted fragmentation indicates labeled and unlabeled hydroxyvaline and isovaleryl residue incorporation with loss of one deuteron likely due to oxidation and reduction by a P450 enzyme. BioRxiv 2018, X

Fig. S29. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of 2d1-valine-labeled detoxin P2. b, Fragmentation with retention of deuterium at the 2 position of valine support incorporation of valine as 4-hydroxyvaline and isovaleryl residues, both without oxidation at position 2. BioRxiv 2018, X

Fig. S30. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of 3d1-valine-labeled detoxin P2. b, Fragmentation with retention of deuterium at the 3 position of valine support incorporation of valine as 4-hydroxyvaline and isovaleryl residues, both without oxidation at position 3. BioRxiv 2018, X

Fig. S31. Tandem MS spectrum of detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B- 24427.

Tandem MS fragmentation supports the assignment of valine, phenylalanine, isovaleryl, and modified proline residues in the detoxin P3 chemical structure. BioRxiv 2018, X

Fig. S32. Stable isotope labeled-amino acid incorporation of d7-proline, d8-phenylalanine, and d8-valine with loss of one deuteron in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427.

All ions indicative of labeled amino acid incorporation in MS1 spectra were confirmed by analysis of fragmentation patterns in the corresponding tandem MS data. The loss of one deuteron from d8-valine in the metabolic labeling experiment is likely due to oxidation and reduction reactions carried out by a P450 enzyme in the Amycolatopsis jejuensis NRRL B-24427 detoxin BGC. The oxidations are retained in detoxin P2, but are fully reduced in detoxins P1 and P3. BioRxiv 2018, X

Fig. S33. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427.

2 a, MS spectrum of d8-valine-labeled detoxin P3. b, Predicted fragmentation indicates labeled and unlabeled valine and isovaleryl residue incorporation with loss of one deuteron likely due to oxidation and reduction by a P450 enzyme. BioRxiv 2018, X REFERENCES

Blondel, V. D., Guillaume, J. L., & Lambiotte, R. (2008). Fast unfolding of communities in large networks.

Journal of Statistical. Retrieved from http://iopscience.iop.org/article/10.1088/1742-

5468/2008/10/P10008/meta

Clauset, A., Newman, M. E. J., & Moore, C. (2004). Finding community structure in very large networks.

Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 70(6 Pt 2), 066111.

https://doi.org/10.1103/PhysRevE.70.066111

Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal,

Complex Systems, 1695(5), 1–9. Retrieved from

http://www.necsi.edu/events/iccs6/papers/c1602a3c126ba822d0bc4293371c.pdf

Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814),

972–976. https://doi.org/10.1126/science.1136800

Langfelder, P., & Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis.

BMC Bioinformatics, 9, 559. https://doi.org/10.1186/1471-2105-9-559

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. (2011).

Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research: JMLR, 12(Oct),

2825–2830. Retrieved from http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Pons, P., & Latapy, M. (2005). Computing Communities in Large Networks Using Random Walks. In

Computer and Information Sciences - ISCIS 2005 (pp. 284–293). Springer Berlin Heidelberg.

https://doi.org/10.1007/11569596_31

Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures

in large-scale networks. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 76(3 Pt 2),

036106. https://doi.org/10.1103/PhysRevE.76.036106

Rosvall, M., Axelsson, D., & Bergstrom, C. T. (2009). The map equation. The European Physical Journal.

Special Topics, 178(1), 13–23. https://doi.org/10.1140/epjst/e2010-01179-1 BioRxiv 2018, X Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community

structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4),

1118–1123. https://doi.org/10.1073/pnas.0706851105

Team, R. C., & Others. (2013). R: A language and environment for statistical computing. Retrieved from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.470.5851&rep=rep1&type=pdf

Van Dongen, S. M. (2000). Graph clustering by flow simulation. Retrieved from

https://dspace.library.uu.nl/handle/1874/848

Wickham, H., Chang, W., & Others. (2008). ggplot2: An implementation of the Grammar of Graphics. R

Package Version 0. 7, URL: http://CRAN. R-Project. Org/package= ggplot2. Retrieved from

http://ftp.auckland.ac.nz/software/CRAN/src/contrib/Descriptions/ggplot.html

Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network

analysis. Statistical Applications in Genetics and Molecular Biology, 4, Article17.

https://doi.org/10.2202/1544-6115.1128

Argimón S, Abudahab K, Goater RJE, Fedosejev A, Bhai J, Glasner C, et al. Microreact: visualizing and sharing data for

genomic epidemiology and phylogeography. Microb Genomics [Internet]. 2016 Nov 30 [cited 2018 Aug 14];2(11).

Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5320705/