Scale Genomic Data
Total Page:16
File Type:pdf, Size:1020Kb
BioRxiv 2018, X Supplementary Information for: A computational framework for systematic exploration of biosynthetic diversity from large- scale genomic data Jorge C. Navarro-Muñoz1,2*, Nelly Selem-Mojica3*, Michael W. Mullowney4*, Satria Kautsar1, James H. Tryon4, Elizabeth Parkinson5$, Emmanuel L.C. De Los Santos6, Marley Yeong1, Pablo Cruz-Morales3, Sahar Abubucker7, Arne Roeters1, Wouter Lokhorst1, Antonio Fernandez-Guerra8, Luciana Teresa Dias Cappelini4, Regan Thomson4, William W. Metcalf5, Neil L. Kelleher4, Francisco Barona-Gomez3#, Marnix H. Medema1# 1Bioinformatics Group, Wageningen University, The Netherlands. 2Westerdijk Fungal Biodiversity Institute, The Netherlands 3Evolution of Metabolic Diversity Laboratory, Unidad de Geómica Avanzada (Langebio), Cinvestav- IPN, Irapuato, Mexico. 4Department of Chemistry, Northwestern University, Evanston, Illinois, United States. 5Carl R. Woese Institute for Genomic Biology and Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States. 6Warwick Integrative Synthetic Biology Centre, University of Warwick, Coventry, United Kingdom. 7Novartis Institutes for BioMedical Research, Cambridge, United States. 8Microbial Genomics and Bioinformatics, Max Planck Institute for Marine Microbiology, Bremen, Germany. *joint first authors #joint corresponding authors E-mail: [email protected], [email protected] $Current address: Department of Chemistry, Purdue University, West Lafayette, Indiana, United States. BioRxiv 2018, X BiG-SCAPE parameters setup 5 Text S1. Index Information 5 Table S1. Manually Curated Compound Groups 6 Text S2. BiG-SCAPE classes weight optimization 16 Figure S1. Scatterplots with weight optimization results. 21 Text S3. MIBiG network cutoff analysis 23 Figure S2. Targeted attack of the MIBiG network 24 Text S4. Clustering analysis on the MIBiG network 25 Figure S3. Clustering analysis on the MIBiG network using glocal mode 25 Figure S4. Clustering analysis on the MIBiG network using global mode 26 Table S2. Clustering methods used to identify potential BGC families in the training networks. 27 Figure S5. Alluvial plot depicting BiG-SCAPE’s assignation of MIBiG BGCs to Gene Cluster Families against the manual curated groups in Table S1. 28 Text S5. Actinobacteria Genomes Data Set 30 Other BiG-SCAPE information 31 Table S3. Compute times for BiG-SCAPE calculations for datasets of multiple sizes. 31 Table S4. Default Anchor Domain List used in the DSS index. 31 Table S5. Logic for internal class separation. 32 Figure S6. Flowchart of BiG-SCAPE components. 33 Figure S9. Verified clusters encoding rimosamide biosynthesis. Signature gene(s) highlighted. 35 Figure S10. Rimosamide-related GCFs in Fig 3a. 36 Figure S11. Phylogenomic reconstruction of 103 complete Streptomyces genomes with outgroups Catenulispora acidiphila CP001700.1 and Salinispora arenicola CP000850.1. 39 Figure S12. BiG-SCAPE / CORASON GCFs of the closed Streptomyces genomes. 40 Figure S13. tauD Actinobacteria EvoMining expansions tree. 41 Table S6. Occurrences of tauD homologues in secondary metabolism BGCs as reported in MIBiG. 42 Microbiology and metabolomics methods 43 Text S6. Cultivation of actinomycetes for MS-based metabolomics. 43 Text S7. Acquisition and analysis of LC-MS metabolomics data. 43 Metabologenomic correlations 43 Table S7. Size of each correlation round resulting from various BiG-SCAPE outputs. 43 Fig. S14. Binary correlation score chart between ions and GCFs used for metabologenomics. 44 Table S8. Correlation scores for known ion-GCF pairs in the dataset across the four correlation rounds. The correlation score is highlighted in yellow if it was higher than the estimated 1% FDR threshold. 44 Text S8. Metabolic labeling of detoxins N1, N2, P1, P2, and P3 with stable isotope-labeled amino acids. 45 BioRxiv 2018, X Metabolomics data and structural analysis 46 Structural characterization of detoxin S1 from the P450/enoyl clade. 46 Fig. S15. Tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S- 325. 47 Fig. S16. Comparison of the tandem MS spectrum of detoxin S1 (1, m/z 518.324) from Streptomyces species NRRL S-325 with the tandem MS spectrum of detoxin B3. 48 Structural characterization of detoxins N1 and N2 from the supercluster clade. 49 Fig. S17. Tandem MS spectrum of detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792. 50 13 Fig. S18. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N1 (2, m/z 464.239) from Streptomyces spectabilis NRRL 2792. 51 Fig. S19. Tandem MS spectrum of detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792. 52 13 Fig. S20. Stable isotope labeled-amino acid incorporation of d7-proline, phenyl-d4-tyrosine, and C6- isoleucine in detoxin N2 (3, m/z 522.244) from Streptomyces spectabilis NRRL 2792. 53 Structural characterization of detoxins P1, P2, and P3 from the Amycolatopsis/P450 clade. 54 Fig. S21. Tandem MS spectrum of detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 55 Fig. S22. Stable isotope labeled-amino acid incorporation of d7-proline and d8-valine with loss of one deuteron in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 56 Fig. S23. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 57 Fig. S24. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 58 Fig. S25. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P1 (4, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 59 Fig. S26. Tandem MS spectrum of detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 60 15 Fig. S27. Stable isotope labeled-amino acid incorporation of d7-proline, d8, N-phenylalanine, and d8- valine with loss of one deuteron in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 61 Fig. S28. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 62 Fig. S29. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 2d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 63 Fig. S30. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of 3d1- valine in detoxin P2 (5, m/z 506.286) from Amycolatopsis jejuensis NRRL B-24427. 64 Fig. S31. Tandem MS spectrum of detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 65 BioRxiv 2018, X Fig. S32. Stable isotope labeled-amino acid incorporation of d7-proline, d8-phenylalanine, and d8- valine with loss of one deuteron in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 66 Fig. S33. Tandem MS fragmentation data for stable isotope labeled-amino acid incorporation of d8- valine in detoxin P3 (6, m/z 490.291) from Amycolatopsis jejuensis NRRL B-24427. 67 REFERENCES 68 BioRxiv 2018, X BiG-SCAPE parameters setup Text S1. Index Information The distance between two Biosynthetic Gene Clusters (BGCs) A and B is calculated by combining three similarity scores: Jaccard Index: a coefficient of all distinct shared types of domains divided by the total number of distinct domain types: ! ! �! ∩ �! ��!" = ! ! �! ∪ �! ! where �! ! is the total number of domains in BGC A (B) and �! ! is the total number of distinct domains in BGC A (B) Adjacency Index: a coefficient of all distinct shared pairs of domains divided by the total number of distinct pairs of domain types: ! ! �! ∩ �! ��!" = ! ! �! ∪ �! ! where �! ! are all the different adjacent domain pairs types in BGC A (B), e.g.: ! �! = �! � , �! � + 1 ∣ � ∈ 0, 1, 2, … , �! − 1 where �! is the length of the list of domains predicted in BGC A. For genes in multi-record files (e.g. with information from different loci, only in case of MIBiG reference gene clusters), the pairs will be formed in the order in which the domains appear in the original file. Domain Sequence Similarity: a score that considers the sequence similarities for every domain type. ��� = 1 − ���! Where ���! is the Domain Sequence Dissimilarity. This is divided further into two subcomponents, ���! = �!���! + �!���!". The first accounts for so-called anchor domains, a list of domains which can be given a special weight (for a list of default anchor domains �, which contains well-known domains for e.g. NRPS or PKS BGCs, see Table S4) while the second one accounts for the rest of domains ��. Each (non)anchor subcomponent is calculated in the same manner: 1 1 ���! = �� �!, �! , ���!" = �� �!, �! �! �!" ! ! Where �! = !∈{!} max �! , �! ; d are all distinct domain types in the pair that belong to the list of ! anchor domains and �!(!) are the number of copies of domain d in BGC A(B). sd is a function that takes all copies of domain d in A and B, and returns the sum of the complement of the sequence similarity, (1 − ��, the latter calculated with domain sequences aligned against their hmm profile using hmmalign) of the best matching copies of the same domain type (using the Hungarian algorithm). For extra copies that don’t have a match or unshared domains, the function returns 1 (a complete dissimilarity). Finally, if there exist domains of each kind in the pair, both subcomponents are weighted first proportionally to the total number of domains of each type (including copies): �! �!" �! = , �!" = �! + �!" �! + �!" and then re-weighted to increase the perceived amount of anchor domains through the “anchorboost” parameter: �!×���ℎ������� �!" �! = , �! = �!×���ℎ������� + �!" �!×���ℎ������� + �!" BioRxiv 2018, X Table S1.