bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Global analysis of adenylate-forming enzymes reveals β-lactone biosynthesis pathway in pathogenic Nocardia

Serina L. Robinson,1,2,3* Barbara R. Terlouw,4 Megan D. Smith,1,3 Sacha J. Pidot,5 Tim P. Stinear,5 Marnix H. Medema,4 and Lawrence P. Wackett1,2,3

1BioTechnology Institute, University of Minnesota, 1479 Gortner Avenue, Saint Paul, MN, 55108, USA

2Graduate Program in Bioinformatics and Computational Biology, University of Minnesota 111 S. Broadway, Suite 300, Rochester, MN, 55904, USA

3Graduate Program in Microbiology, Immunology, and Cancer Biology, University of Minnesota, 689 23rd Ave SE, Minneapolis, MN, 55455, USA

4Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands

5Department of Microbiology and Immunology at the Doherty Institute, University of Melbourne, 792 Elizabeth St, Melbourne, VIC, 3000 Australia

*Corresponding author: Serina Robinson E-mail: [email protected]

Running title: Global analysis of adenylate-forming enzymes

Keywords: natural product biosynthesis, acetyl-CoA synthetase, enzyme catalysis, bioinformatics, machine learning, , adenylate-forming enzymes, Nocardia, β-lactone synthetases, substrate specificity

ABSTRACT based predictive tool and used it to comprehensively map the biochemical diversity Enzymes that cleave ATP to activate carboxylic of adenylate-forming enzymes across >50,000 acids play essential roles in primary and secondary candidate biosynthetic gene clusters in bacterial, metabolism in all domains of life. Class I adenylate- fungal, and plant genomes. Ancestral enzyme forming enzymes share a conserved structural fold reconstruction and sequence similarity but act on a wide range of substrates to catalyze networking revealed a ‘hub’ topology suggesting reactions involved in bioluminescence, radial divergence of the adenylate-forming nonribosomal peptide biosynthesis, fatty acid superfamily from a core enzyme scaffold most activation, and β-lactone formation. Despite their related to contemporary aryl-CoA ligases. Our metabolic importance, the substrates and catalytic classifier also predicted β-lactone synthetases in functions of the vast majority of adenylate-forming novel biosynthetic gene clusters conserved across enzymes are unknown without tools available to >90 different strains of Nocardia. To test our accurately predict them. Given the crucial roles of computational predictions, we purified a adenylate-forming enzymes in biosynthesis, this candidate β-lactone synthetase from Nocardia also severely limits our ability to predict natural brasiliensis and reconstituted the biosynthetic product structures from biosynthetic gene clusters. pathway in vitro to link the gene cluster to the β- Here we used machine learning to predict lactone natural product, nocardiolactone. We adenylate-forming enzyme function and substrate anticipate our machine learning approach will aid specificity from protein sequence. We built a web- in functional classification of enzymes and advance natural product discovery.

1 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

INTRODUCTION longer supported or available, showed that this class of enzymes is amenable to computational Adenylation is a widespread and essential reaction predictions of substrate and function (10). in nature to transform inert carboxylic acid groups Previously, bioinformatics tools were also into high energy acyl-AMP intermediates. Class I developed to predict substrates for NRPS adenylate-forming enzymes catalyze reactions for adenylation (A) domains (11, 12). Genome natural product biosynthesis, firefly mining approaches using NRPS A domain bioluminescence, and the activation of fatty acids prediction tools have proved useful to access the with coenzyme A (1). The conversion of acetate to biosynthetic potential of unculturable organisms acetyl-CoA by a partially purified acetyl-CoA and link ‘orphan’ natural products with their ligase was first described by Lipmann in 1944 (2). biosynthetic gene clusters. For example, NRPS A Since then, enzymes with this conserved structural domain predictions guided the discovery of the fold have been found to activate over 200 different biosynthetic machinery for the leinamycin family substrates including aromatic, aliphatic, and amino of natural products (13) and enabled sequence- acids. To encompass the major functional enzyme based structure prediction of several novel classes, the term ‘ANL’ superfamily was proposed lipopeptides by Zhao et al. (14). However, Zhao based on the Acyl-CoA ligases, Nonribosomal et al. reported limitations in existing tools in that peptide synthetases (NRPS), and Luciferases (3). they could not predict the chain length of lipid tails incorporated into lipopeptides. Lipid tails are Most ANL superfamily enzymes catalyze two-step prevalent in natural products and are incorporated reactions: adenylation followed by by fatty acyl-AMP or acyl-CoA ligase enzymes, thioesterification. During the thioesterification both in the ANL superfamily. These enzymes are step, ANL enzymes undergo a dramatic among the most well-studied subclasses of ANL conformational change involving a 140° domain enzymes. For less-studied subclasses, enzyme rotation of the C-terminal domain (3). functions and substrates are even more bond formation results from nucleophilic attack, challenging to predict. Hence, a computational typically by a phosphopantetheine thiol group. A tool encompassing all classes of ANL enzymes notable exception to phosphopantetheine is the use would constitute a major step towards more of molecular oxygen by firefly luciferase to convert accurate structural prediction of natural product D-luciferin to a light-emitting oxidized intermediate scaffolds. (3). Other interesting exceptions include functionally-divergent ANL enzymes within the Here, we used machine learning to develop a same pathway that catalyze the adenylation and predictive platform for ANL superfamily thioesterification partial reactions separately (4, 5). enzymes and map their substrate-and-function The ANL superfamily has recently expanded to landscape across 50,064 candidate biosynthetic include several new classes of enzymes including gene clusters in bacterial, fungal, and plant the fatty-acyl AMP ligases (6), aryl polyene genomes. We detected candidate β-lactone adenylation enzymes (7), and β-lactone synthetases synthetases in uncharacterized biosynthetic gene (8). Strikingly, an ANL enzyme in cremeomycin clusters from pathogenic Nocardia spp. and biosynthesis was shown to use nitrite to catalyze experimentally validated the gene cluster in vitro late stage N-N bond formation in diazo-containing to link it to the orphan β-lactone compound, natural products (9). The discovery of such novel nocardiolactone. Overall, this research provides a enzymatic reactions nearly 75 years after the proof-of-principle towards the use of machine Lipmann’s initial report suggests the ANL learning for classification of enzyme substrates to superfamily still has unexplored catalytic potential, guide natural product discovery. particularly in specialized metabolism.

No computational tools currently exist for prediction and functional classification of ANL enzymes at the superfamily level. Still, the development of one previous platform, which is no

2 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

RESULTS Previously, 34 amino acids within 8 Å of the Machine learning can accurately predict ANL gramicidin synthetase active site were shown enzyme function and substrate specificity to be critical for accurate prediction of NRPS A domain substrate specificity (11). Due to the high A global analysis of protein family domains level of structural conservation between NRPS A revealed that ANL superfamily enzymes (PF00501; domains and other proteins in the superfamily, we AMP-binding domains) are the third most abundant hypothesized these 34 active site residues would domain in known natural product biosynthetic also be important features for ANL substrate pathways (Table S1). Despite the essential and prediction. Using an AMP-binding profile varied roles of ANL enzymes, there is no database Hidden Markov Model (pHMM), we extracted 34 that catalogs their biosynthetic diversity. Therefore, active site residues from our >1,700 training set we mined the literature, MIBiG (15), and sequences. Further inspection of the superfamily- UniProtKB (16) for ANL enzymes with known wide pHMM alignment revealed the presence of substrate specificities. We then constructed a a fatty acyl-AMP ligase-specific insertion (FSI) training set of >1,700 ANL protein sequences of 20 amino acids not present in other paired with their functional class, substrate(s), superfamily members (Fig. S1). The FSI has been kinetic data, and crystal structures if solved. As suggested to inhibit the 140° domain rotation of reported previously by Gulick and others, ANL the C-terminal domain and is critical for the superfamily enzymes in our training set were rejection of CoA-SH as an acceptor molecule (6). divergent at the sequence level but shared a Since the FSI was shown experimentally to be an common structural fold and core motifs including important feature to distinguish FAAL from (Y/F)(G/W)X(A/T)E and (S/T)GD critical for LACS enzymes, we extracted 20 FSI residues ATP-binding and catalysis (3). from each sequence and appended them to our feature vector for a total of 54 residues. Each We defined nine major functional enzyme classes amino acid was further encoded as 15 normalized on the basis of having enough experimentally- real numbers corresponding to different characterized enzymes per class to enable physicochemical properties including classification by machine learning: short chain acyl hydrophobicity, volume, secondary structure, and CoA synthetases (C2 – C5, SACS), medium chain electronic properties (11). The physicochemical acyl-CoA synthetases (C6 – C12, MACS), long properties were then used to train machine chain acyl-CoA synthetases (C13 – C17, LACS), learning models to predict enzyme function and very-long chain acyl-CoA synthetases (C18+, substrate (Fig. 1A). VLACS), fatty acyl-AMP ligases (FAAL), luciferases (LUC), β-lactone synthetases (BLS), Three different machine learning algorithms were aryl-CoA ligases (ARYL), and NRPS A domains evaluated for our classification problem: (NRPS). In addition, we trained a separate model to feedforward neural networks, naïve Bayes, and predict enzyme substrate specificity. While random forest. Random forest performed slightly prediction at the level of an individual substrate is better than its counterparts for both function and desirable, the broad substrate specificity of some substrate classification problems (Fig. 1B and classes of ANL enzymes limited prediction to Fig. S2). Naïve Bayes also performed well, but groups of chemically-similar compounds. For was significantly slower than random forest in example, one class of LACS has demonstrated run-time. The feedforward neural network activity with fatty acids with chain lengths ranging performed the worst, likely due to a relatively from C8 – C20 (17). There were over 200 small number of training samples, and was also chemically-distinct substrates in our training set, the slowest model to train. On the basis of speed many with just one experimental example. and accuracy we chose to proceed with the Therefore, we clustered substrates based on random forest algorithm. Our best performing chemical similarity (Tanimoto coefficient) to model achieved 82.5 ± 2.3% test set classification identify 15 groups for broad level substrate accuracy for substrate specificity and 83.3 ± 3.0% classification (Table S2). for functional class prediction (Fig. 1B and Fig.

3 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

S2). The average area under the receiver operating two evolutionary scenarios: 1) aryl-CoA ligase characteristic curve (micro-average AUROC) was activity arose independently several times 0.981 for substrate group prediction and 0.978 for throughout ANL superfamily evolution or 2) functional class (Fig. 1C and Fig. S2). Within-class radial divergence of the superfamily occurred accuracy was highest for the FAAL and NRPS A from an ancestral scaffold similar to domains, most likely due to a larger amount of contemporary ARYL enzymes (18). To experimental data for these protein families (Fig. investigate these scenarios further, we used a S2). Both of our classifiers also performed well maximum-likelihood approach for ancestral with more specialized enzymes that accept single sequence reconstruction to estimate the 10 most substrates such as the BLS and LUC classes (Fig. likely sequences for the predicted ancestral S2). We speculate that enzymes with specialized protein at the root of the ANL phylogeny (19). functions might have distinct ‘active site patterns’ We used AdenylPred to extract sequence features that could be learned by our algorithms due to the and predict function and substrate for our preference of these enzymes for single substrates reconstructed ancestral proteins. AdenylPred (e.g. D-luciferin). Our machine learning model classified 10/10 of the most likely ancestral ANL consistently performed the worst on substrate proteins as aryl-CoA ligases most likely to classification for enzymes with broad substrate activate aryl and biaryl derivatives as substrates specificity such as the aryl/biaryl acids and C6 – C11 (probability score = 0.6). We also tested a chain length fatty acids and aryl acids (Fig. 1D). maximum-likelihood ancestral reconstruction using only 34 active site residues as the seed We developed a web application, AdenylPred sequences rather than full sequences and obtained (z.umn.edu/AdenylPred), for Adenylation similar ARYL predictions (probability score = Prediction to make our machine learning models 0.7). These findings suggest that the active sites publicly available. Users can upload their of our reconstructed ancestral ANL proteins were sequences of adenylate-forming enzymes in multi- most similar to contemporary ARYL enzymes. FASTA or GenBank format either as nucleotide or protein sequences. Predictions and probability Sequence similarity networking reveals a scores are reported for both functional classification ‘hub’ network topology and substrate specificity. The entire ANL enzyme training set is also available in a searchable To construct a map the distribution of adenylate- database format. Overall, the web app provides an forming enzymes in biosynthetic gene clusters, interactive interface for users with little we applied AdenylPred to a taxonomically- computational experience. A command-line diverse and representative collection of bacterial, version of the tool is also available for download fungal, and plant genomes. We extracted 50,064 for the analysis of large datasets. standalone adenylate-forming enzyme sequences from candidate biosynthetic gene clusters Specialized ANL enzymes may have evolved detected using antiSMASH (20), fungiSMASH from an ancestral scaffold utilizing CoA-SH (21), and plantiSMASH (22). To visualize results, we constructed a sequence similarity We next used a phylogenetic approach to network with adenylate-forming enzymes investigate the functional divergence of ANL colored by their functional classification using superfamily enzymes (Fig. 2). Maximum- AdenylPred (Fig. 3A). The sequence similarity likelihood phylogenetic analysis revealed that some network displayed a ‘hub-and-spoke’ topology in functional classes, including the BLS, LUC, and which most functional protein subfamilies NRPS enzymes, formed tight monophyletic clades showed higher sequence similarity with hub whereas others, mainly enzymes in the ARYL sequences than with any other subfamily (Fig. class, were dispersed throughout the tree. The wide 3A). The hub nodes were mainly comprised of phylogenetic distribution of the ARYL sequences aryl-CoA ligases. To test the robustness of this indicated many ARYL enzymes were more closely topology with a different sequence set, we related to different protein subfamilies than to each constructed a sequence similarity network from a other. This observation could likely be explained by smaller set of all AMP-binding domains from the

4 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

manually-curated MIBiG database (Fig. S3). to the di-alkyl backbone (23). Notably, Again, we recovered the same topology with the NRPS and formyltransferase genes were absent aryl-CoA ligases at the center. Of the 48,250 full from all candidate clusters in Nocardia (Fig. 4B). length AMP-binding hits that we analyzed with We hypothesized that this minimal set of AdenylPred, the majority were predicted to be in biosynthetic genes, termed nltABCD, encode the the ARYL and NRPS classes (Fig. 3B). There were necessary enzymes for Nocardia spp. to produce no predicted luciferases in the biosynthetic gene a di-alkyl β-lactone product similar to lipstatin clusters which was expected since insect genomes but lacking an amino acid side chain. Predicted including fireflies were not included in the dataset. functions of genes in the nltABCD operon Interestingly, the predicted FAALs outnumbered correspond to reactions required to produce the LACS more than tenfold (3,673 to 322). Since both orphan structure of the β-lactone product, FAALs and LACSs both accept long-chain fatty nocardiolactone (Fig. 4A), prompting us to acids as substrates, these findings support previous characterize the biosynthetic enzymes and reports that the majority of lipid tails in lipopeptides pathway experimentally. and other natural products may be incorporated through FAAL-mediated activity rather than CoA- In vitro reconstitution of the nocardiolactone activation (23). β-Lactone formation and CoA- pathway links biosynthetic gene cluster to its activation of long chain fatty acids and were the orphan natural product least common functions likely catalyzed by ANL enzymes in biosynthetic gene clusters (Fig. 3B). Nocardiolactone was originally isolated in 1999 from a pathogenic strain of Nocardia brasiliensis AdenylPred-guided discovery of β-lactone and other unidentified strains of Nocardia spp., synthetases in biosynthetic gene clusters none of which are available in public culture collections (26). Genetic manipulation in β-Lactone synthetases were the most recently Nocardia is challenging therefore we opted discovered members of the ANL superfamily and instead to reconstitute the complete biosynthetic comprehensive analysis of their prevalence in pathway in vitro by heterologously expressing natural product biosynthetic gene clusters has never and purifying individual nltABCD pathway been conducted. We examined AdenylPred hits for enzymes. This approach gave us full control to β-lactone synthetases from our collection of 50,064 determine the function of each pathway enzyme candidate biosynthetic gene clusters. As expected, through biochemical analysis of intermediates we detected candidate β-lactone synthetases in gene and comparison to synthetic standards. clusters responsible for the biosynthesis of known β-lactone natural products including ebelactone and We cloned and expressed a candidate β-lactone lipstatin (24, 25). We also unexpectedly detected synthetase, termed NltC, from a publicly uncharacterized candidate β-lactone synthetases in available N. brasiliensis genome. NltC purified as >90 genomes belonging to the bacterial genus a 60 kDa monomer. To test NltC for β-lactone Nocardia. Only one β-lactone natural product, synthetase activity, we incubated NltC, ATP, and nocardiolactone, from Nocardia spp., has been MgCl2 with syn- and anti-2-octyl-3- isolated to date (26). No follow-up studies on hydroxydodecanoic acid diastereomers nocardiolactone were published and the synthesized as substrate analogs (Fig. 5A). We biosynthetic gene cluster was never reported observed NltC-catalyzed formation of cis-and resulting in nocardiolactone being termed an trans-β-lactones after overnight reaction ‘orphan’ natural product. compared to no enzyme controls (Fig. S4A). The coupling constants were consistent with synthetic We identified three flanking genes around the standards for cis- and trans-β-lactones of predicted β-lactone synthetase genes in Nocardia comparable chain length. AMP release is also spp. that shared synteny with biosynthetic genes for commonly used as a readout for ANL the β-lactone natural product, lipstatin (Fig. 4A). superfamily activity. In a time-course analysis of The lipstatin cluster encodes NRPS and AMP release by NltC, we observed activity with formyltransferase enzymes that attach an N-formyl C20 chain length β-hydroxy acids but no activity

5 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

with C14 length analogs above the level of the no- positioned glutamate from the β-chain was shown substrate control (Fig. 5A). NltC also showed weak to enter the active site of the α-chain in the OleA activity with 2-hexyldecanoic acid as a substrate homodimer and deprotonate the α-carbon to mimic, but not with 10-nonadecanol, suggesting activate it (29). Based on homology modeling and adenylation of the carboxylic acid occurs rather structural alignments with OleA, NltBE57 may be than hydroxyl group (Fig. S4B). These results are similarly poised to act as a nucleophile in the consistent with the reported adenylation activity of NltA active site (Fig. S5A). We used site-directed for the β-lactone synthetase from Xanthomonas mutagenesis to mutate the NltB glutamate (E57) campestris and with the ANL superfamily (27). to either alanine or glutamine. Claisen Overall, these results support AdenylPred-guided condensation activity was abolished in NltABE57A predictions that NltC in N. brasiliensis is a and NltABE57Q mutants compared to wild-type functional β-lactone synthetase. NltAB (Fig. S5B). Taken together, these results suggest NltA and NltB may form a functional To further characterize enzymes involved in heterodimer with NltBE57 required to catalyze the nocardiolactone biosynthesis, we heterologously Claisen condensation of two fatty acyl-CoA expressed and purified two upstream thiolases, substrates. Crystallographic studies are required NltA and NltB. Recently, homologous enzymes in to investigate LstAB and NltAB further. the lipstatin biosynthetic pathway, LstA and LstB were shown to form a functional heterodimer to The final enzyme in the nocardiolactone cluster, catalyze ‘head-to-head’ Claisen condensation of NltD, belongs to the short chain reductase two acyl-CoAs (28). We hypothesized NltA and superfamily. NltD purified as a 78 kDa fusion NltB might catalyze a similar reaction to form the protein with a maltose binding protein tag. NltD nocardiolactone backbone (Fig. 5B). When has an N-terminal conserved nucleotide binding heterologously expressed, NltA was mostly motif (Rossman fold) and a SXnYXXXK insoluble and formed inclusion bodies even when catalytic triad characteristic of short chain co-expressed with chaperones. We also attempted reductase superfamily members (30). NltD shares to purify a NltA homolog from a closely-related 53% amino acid identity with the lipstatin organism, Nocardia yamanashiensis, and again reductase (LstD) and 36% identity with X. observed inclusion body formation. In contrast, campestris OleD, a 2-alkyl-3-ketoalkanoic acid NltB could be purified with a moderate yield (~28 reductase involved in olefin biosynthesis (Fig. mg/L of culture). When tested individually, NltA 4A). Since studies on OleD demonstrated 2-alkyl- and NltB protein preparations did not display 3-ketoalkanoic acids are unstable, we monitored activity with any chain length (C8 – C16) acyl-CoA the reaction in reverse with 2-alkyl-3- substrate tested. However, when NltA and NltB hydroxyalkanoic acid substrates using a were co-expressed on a single plasmid we obtained spectrophotometric assay for NADPH formation soluble protein that actively catalyzed the Claisen (30). Purified NltD catalyzed NADP+-dependent condensation of long chain acyl-CoAs to β- conversion of 2-alkyl-3-hydroxyalkanoic acid to ketoacids (Fig. 5B). Size exclusion chromatography 2-alkyl-3-ketoalkanoic acid with both C20 and C14 indicated that co-expressed NltAB eluted at di-alkyl β-hydroxy acid substrates at a rate approximately 72 kDa, suggesting a heterodimeric similar to X. campestris OleD (Fig. S6). state. We next reconstituted the entire pathway to From homology modeling and sequence analysis produce nocardiolactone-like analogs by we observed that the NltB sequence was truncated combining purified pathway enzymes with and lacked a Cys-His-Asn catalytic triad similar to decanoyl-CoA precursors, NADPH, ATP, and reports for LstB (28). Although no crystal structure MgCl2 (Fig. 5C and Fig. S7). Due to instability of LstAB is available, site-directed mutagenesis and loss of activity of purified NltAB over time, revealed a conserved glutamate in LstB (E60) was we substituted the functionally-equivalent and required for condensation activity (28). In the stable homodimer, OleA, from X. campestris crystal structure of the physiological homodimer, (31). After overnight incubation, we observed OleA, from Xanthomonas campestris, a similarly-

6 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

formation of a di-alkyl β-lactone natural product DISCUSSION (Fig. S7). Based on in vitro evidence, we propose nocardiolactone biosynthesis is initiated via ‘head- Based on our ancestral reconstruction, we to-head’ Claisen condensation of fatty acyl-CoA propose ancient ANL enzymes had an active site substrates catalyzed by a heterodimeric interaction most similar to contemporary enzymes that between NltA and NltB. NltD then reduces the di- activate aryl and biaryl acids with CoA. This alkyl β-keto acid to a di-alkyl β-hydroxy acid in an hypothesis is supported by the unusual hub NADPH-dependent manner. Finally, topology of the ANL network. Many other intramolecular ring closure of a β-hydroxy acid to a enzyme superfamilies do not have this topology β-lactone is catalyzed by the ATP-dependent β- and instead tend to show patterns of sequential lactone synthetase NltC (Fig. 5C). Overall, these functional divergence (33, 34). However, Babbitt results link the nltABCD biosynthetic gene cluster and colleagues detected a similar hub topology in in Nocardia brasiliensis to the orphan natural the nitroreductase superfamily and provided product, nocardiolactone. multiple lines of evidence supporting that the topology indicated radial divergence of the The nocardiolactone gene cluster is enriched in superfamily from a minimal flavin-binding human pathogens scaffold (18). Our results also suggest ANL sequences may have undergone divergent With the biosynthetic gene cluster established, we evolution towards more specialized functions next probed the taxonomic distribution and from ancestral enzymes with CoA-ligase-like abundance of the nocardiolactone pathway in scaffolds. Nocardia genomes. We used AdenylPred to identify putative β-lactone synthetases and detected The proposed evolutionary trajectory of the ANL nltC homologs with flanking nltABD genes in 94 superfamily is supported by experimental out of 159 complete Nocardia genomes in the evidence for CoA-ligase activity in many extant PATRIC database (32). Notably, the strict nltABCD ANL enzymes that primarily perform other cluster was not detected in any closely related functions. For example, Linne et al. tested the genera such as Rhodococcus, Streptomyces, or CoA-ligase activity of five different NRPS A Mycobacterium, suggesting nocardiolactone domains (35). Surprisingly, all the NRPS A biosynthesis may be specific to the genus Nocardia. domains were also able to synthesize acyl-CoAs We observed the nocardiolactone gene cluster was in vitro. Enzymatic CoA-ligase activity of the more prevalent among strains of pathogenic clinical NRPS A domains varied proportionally to their isolates from human patients than in strains isolated evolutionary similarity with a native acyl-CoA from other sources (Fig. 4). The complete nltABCD synthetase, suggesting that greater sequence cluster was detected in 68% of genomes of distinct divergence resulted in more specialized NRPS A species of human pathogenic Nocardia relative to domain activity and reduced bifunctionality (35). 27% isolated from non-human sources (p-value: Firefly luciferases were also demonstrated to be 0.002, Fisher’s exact test). We also queried a bifunctional as CoA-ligases (36) and luciferase separate private database comprised of 169 clinical activity was conferred to an acyl-CoA ligase from isolates of Nocardia from human patients (Pidot, a non-luminescent click beetle by just a single Stinear, in prep) and recovered a similar proportion point mutation (37). Arora et al. showed FAAL of complete nocardiolactone cluster hits among enzymes likely lost their CoA-ligase activity due clinical isolates (112/169, ~66%). Although only to the FSI. Indeed, deletion of the FSI conferred correlative, the enrichment warrants further acyl-CoA ligase activity in FAAL28 from research and suggests a potential link between the Mycobacterium tuberculosis (6). The fact that nocardiolactone cluster and Nocardia simple mutations can revert many enzymes back pathogenicity. to a CoA-ligase state supports the hypothesis that the ANL superfamily arose from an ancestral enzyme using CoA-SH as an acceptor molecule. However, we cannot rule out the alternative hypothesis of CoA-ligase activity arising

7 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

independently several times in the ANL as intermediates in the biosynthesis of superfamily. Overall, ancestral reconstruction hydrocarbons and other secondary metabolites. coupled with AdenylPred analysis yielded new insights into the evolutionary structure-function Among the predicted gene clusters with β-lactone relationships among adenylate-forming enzymes. synthetase homologs, we detected a conserved cluster in Nocardia which we linked to the orphan Within the ANL superfamily, the β-lactone β-lactone natural product, nocardiolactone. synthetases were most recently discovered and the Numerous biosynthetic gene clusters have been extent of their role in natural product biosynthesis detected in Nocardia spp., and many genomes remained poorly understood (8, 27). We used from Nocardia have as many type I polyketide AdenylPred to detect >90 β-lactone synthetases in synthases and NRPS gene clusters as uncharacterized biosynthetic gene clusters which is Streptomyces genomes (41). Despite this, only a significantly more than the eight known small number of biosynthetic gene clusters in biosynthetic gene clusters for β-lactone natural Nocardia have been experimentally-verified. products reported to date (38). The disparity Recently, the biosynthetic gene cluster for the between the number of predicted β-lactone nargenicin family of macrolide antibiotics was biosynthetic gene clusters and known β-lactone discovered in human pathogenic strains of natural products has several possible explanations. Nocardia arthriditis (42). Khosla and colleagues One reason may be limited discovery due to the also reported on a unique class of orphan reactivity and thermal instability of β-lactones. β- polyketide synthases in the genomes of Nocardia lactones are strained rings that can rapidly isolates from human patients with nocardiosis hydrolyze in aqueous solutions (38) or thermally (43). Based on the conservation of this gene decarboxylate, thus hampering their detection by cluster in clinical isolates from nocardiosis common analytical methods such as GC-MS (8). It patients, Khosla proposed the product might play is also plausible many biosynthetic gene clusters a role in Nocardia pathogenicity in human hosts. with β-lactone synthetases are not expressed under Similarly, we found the nocardiolactone gene normal laboratory conditions. Another explanation cluster was significantly more abundant in the is the undetected role of β-lactones as intermediates genomes of human pathogenic strains compared in the biosynthesis of other chemical moieties (8). to isolates from non-human sources. The Chemists have long referred to β-lactones as enrichment supports the hypothesis that ‘privileged structures’ for the total synthesis of nocardiolactone could play a role in the compounds with a variety of functional groups pathogenicity of Nocardia however further in including β-lactams, gamma-lactones, and alkenes vivo studies are required. (38, 39). Our findings suggest microbes might also use β-lactones as intermediates since we detected a Nocardiolactone was first isolated from the number of β-lactone synthetase hits in gene clusters mycelia of Nocardia spp. rather than the known to make natural products without final β- fermentation broth, suggesting the compound is lactone moieties such as polyunsaturated fatty cell-associated (26). The long, waxy di-alkyl tails acids. Indeed, β-lactone synthetases in oleABCD of nocardiolactone resemble mycolic acids and gene clusters were recently linked to production of would likely embed in the cell membrane. The the final alkene moiety in the biosynthesis of a C31 cell wall composition varies between different polyunsaturated hydrocarbon product (40). Such species of Nocardia but is known to consist cases, if more widespread, may have also escaped primarily of trehalose dimycolate and several detection because most of the predicted gene other unidentified hydrophobic compounds (44). clusters with β-lactone synthetases were not Studies on Nocardia virulence found that cell- detected by tools like antiSMASH (20) until surface composition was a critical determinant recently, prohibiting the discovery of such for attachment and penetration of host cells (45). biosynthetic pathways through genome mining Cell wall-associated lipids in N. brasiliensis were approaches. The recent feature implementation in also shown to induce a strong inflammatory antiSMASH to detect likely β-lactone moieties now response (44). The role of hydrophobic natural enables further research into the role of β-lactones products in Nocardia pathogenicity and infection

8 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

remains a rich and untapped direction for future AdenylPred training set sequences to observe research. their relation to sequences with known specificity. A sequence similarity network was In summary, we conducted the first global analysis constructed by calculating pairwise BLAST of adenylate-forming enzymes in >50,000 similarities between all sequences. The network candidate biosynthetic gene clusters from all was examined over a comprehensive range of e- domains of life. Our machine learning approach value cutoffs and visualized using the ‘igraph’ yielded evolutionary insights into ANL superfamily package (48). divergence from a core scaffold similar to contemporary aryl-CoA ligases towards enzymes Training set construction The AdenylPred with more specialized functions such as β-lactone training set was pulled and manually-curated formation. AdenylPred analysis also detected >90 from three databases: UniProtKB, MIBiG, and β-lactone synthetases in gene clusters in Nocardia the most up-to-date NRPS A domain training set spp. that were enriched in the genomes of human from SANDPUMA (12). AMP-binding enzymes pathogens. Through in vitro pathway (PF00501) in the UniProtKB database that had reconstitution, we were able to link this gene cluster experimental evidence at the protein level were family to the orphan natural product extracted and linked to their substrate through nocardiolactone. These findings demonstrate how literature mining and manual verification. The machine learning methods can be used to pair gene MIBiG database was queried with the AMP- clusters with orphan secondary metabolites and binding HMM (PF00501). MIBiG sequences advance understanding of natural product were extracted and linked to their product and biosynthesis. substrate when reported in the literature. The SANDPUMA NRPS A domain dataset (12) was EXPERIMENTAL PROCEDURES randomly down-sampled with stratification by substrate class to balance the training set classes Sequence similarity network Candidate for substrate and enzyme function prediction. biosynthetic gene clusters from >24,000 bacterial Training set sequences were grouped into 15 genomes in the antiSMASH database version 2 substrate groups and 9 enzyme functional classes. (46) were combined with pre-calculated plantiSMASH output (22) and results from Machine learning methods Protein sequences in fungiSMASH analysis of 1,100 fungal genomes the training set were aligned with the AMP- (accession numbers available at binding HMM using HMMAlign (49). 34 https://github.com/serina-robinson/ substrate residues within 8 Å of the active site adenylpred_analysis/). The AMP-binding pHMM were extracted and encoded as tuples of (PF00501) was used to query all genes in the physicochemical properties as described by dataset with default parameters, returning 213,993 Röttig et al. (11). The FSI was extracted, one-hot significant hits. Since the distribution and identity encoded, and concatenated to the vector of active of NRPS adenylation domains have already been site vector properties for a total of 585 sequence analyzed in detail by Chevrette et al. (12), we opted features. Three different machine learning to analyze only standalone AMP-binding pHMM algorithms were trained for multiclass hits. All sequences with a condensation domain classification: random forest, naïve Bayes, and (PF00668) in the same coding sequence were feedforward neural networks. The data were split filtered out, leaving 71,331 ‘standalone’ AMP- with stratified sampling into 75% training and binding HMM hits. Of these, partial sequences (less 25% test sets. Tuning parameters for all models than 150 amino acids in length) were removed for a were adjusted by grid search using 10-fold cross total of 63,395 sequences. Due to computational validation repeated with 5 iterations. The limitations of visualizing large networks, the confidence of random forest predictions can be sequences were clustered using CD-HIT (47) with assessed using a nonparametric probability a word size of 2 and 40% sequence similarity cut- estimation for class membership calculated as a off to yield 2,344 cluster representatives. Cluster value between 0 and 1 (50). Based on the representatives were combined with the distribution of prediction probabilities, we set an

9 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

empirical threshold for prediction confidence of 0.6 coli and synthesized by Integrated DNA (60%), below which all substrates are listed as ‘no Technologies (IDT, USA). NltC was cloned into confident result’ although the best prediction is still a pET30b+ vector with NdeI and HindIII provided to the user. restriction sites with a C-terminal 6x-His tag. NltD was cloned into a modified pMAL-c5x AdenylPred availability The web application is vector with a tobacco etch virus (TEV) protease available at z.umn.edu/adenylpred (shortened url) cut site added at NdeI and HindIII restriction sites or https://srobinson.shinyapps.io/AdenylPred/. The and a N-terminal 6x-His tag and expressed as a command-line version of the tool is available at fusion with maltose binding protein. NltA from https://github.com/serina-robinson/adenylpred. N. yamanashiensis was cloned into a pET28b+ at NdeI and XhoI restriction sites. NltA and NltB Phylogenetic analysis and ancestral from N. brasiliensis were cloned both reconstruction Training set sequences were individually as in the case of N. yamanashiensis aligned were aligned using HMMAlign (49) and and combined with a ribosome binding site-like terminal ends of the alignment were trimmed. The sequence (tttgtttaactttaagaaggaga) inserted into a phylogeny of the entire training set was estimated single pET28b+ vector with a N-terminal 6x-His using RAxML (51) with the Jones-Taylor-Thornton tag. Accession numbers and codon-optimized matrix-based model of amino acid substitution and plasmid sequences are available in the Supporting a discrete gamma model with 20 rate categories. Information. Constructs were cloned by Gibson For ancestral sequence reconstruction, training set assembly into DH5α cells and verified by Sanger sequences were redundancy filtered to 40% amino sequencing. Sequence-verified plasmids were acid identity with a word size of 2 with CD-HIT transformed into BL21 (DE3) cells (NEB). (47). FastML was used to reconstruct the most Starter cultures (5 mL) were grown in Terrific likely ancestral sequences for internal nodes of the Broth overnight at 37° C with kanamycin tree (19). AdenylPred was then used to predict the selection. 1-Liter cultures with 75 µg/ml functional class and substrate of the root ancestral kanamycin were grown to an optical density of sequence in the ANL superfamily. 0.5 at 37° C, induced by addition of 1 mL isopropyl β-D-1-thiogalactopyranoside (IPTG, 1 HPLC analysis of NltC activity HPLC analysis of M stock) and further incubated for 19 h at 15 °C. ATP, ADP, and AMP was conducted using an Induced cells were harvested at 4000 x g with a Agilent 1100 series instrument with a C18 eclipse Beckman centrifuge and frozen at -80 °C. Cell plus (Agilent) column mounted with a C18 guard pellets were resuspended in 10 mL of buffer column. Reactions were carried out in glass HPLC containing 500 mM NaCl, 20 mM Tris base, and vials with a total volume of 500 µL in assay buffer 10% glycerol at pH 7.4. Elution buffer for the (50 mM Tris base adjusted to pH 8.0 with HCl). nickel column was the same but with the addition Reactions were initiated with the addition of 0.5 of 400 mM imidazole. NltD purification buffer µM enzyme (15 µg) to 100 µM of ATP, 100 µM required the addition of 0.025% Tween 20. Cell of substrate, and 2% EtOH originating from the pellets were thawed on ice, lysed with 2–3 cycles substrate stocks. Separation of ATP, ADP, and in a French pressure cell (1500 pounds per square AMP was observed after 9 min under isocratic inch) and centrifuged for 60 min at 17,000 x g. conditions with 95% 100 mM H2KPO4 (pH 6.0 with Supernatants were filtered through a 0.45 µM low KOH) and 5% methanol while monitoring at 259 protein binding filter (Corning, USA), loaded into nm. a GE Life Sciences ÄKTA fast liquid protein chromatography system and injected onto a GE Cloning, Expression, and Purification of Life Sciences HisTrap HP 5 ml column. After Nocardiolactone Biosynthetic Enzymes NltA washes to remove nonspecifically bound (WP_042260942.1), NltB (WP_042260944.1), proteins, His-tagged proteins were eluted with a NltC (WP_042260945.1) and NltD stepwise gradient of 5, 10, 15, and 80% elution (WP_042260949.1) from Nocardia brasiliensis and buffer over 2.5 column volumes at 1 ml/min and NltA from Nocardia yamanshiensis collected in 2 ml fractions. Protein concentration (WP_067710538.1), were codon-optimized for E. was determined by the method of Bradford (52)

10 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

using the Bio-Rad (USA) Protein Assay Dye and trans-3-octyl-4-nonyloxetan-2-one products Reagent Concentrate and a standard curve prepared were reported by Christenson et al. (8). from a 2 mg/mL bovine albumin standard (Thermo Scientific). The desired protein fractions were NltD activity assay Activity of NltD/OleD by pooled, analyzed by SDS-PAGE, flash frozen in NADPH-dependent consumption of 2-alkyl-3- liquid N2 and stored at -80°C. ketoalkanoic acid was monitored using the method described by Bonnett et al. (30). The GC-MS Separation and identification of reaction was measured in reverse since the β-keto metabolites was accomplished by GC-MS (Agilent acid substrate for NltD/OleD homologs was 7890a & 5975c) equipped with a 30 m x 0.25 mm previously shown to be unstable and undergoes i.d. x 0.25 µm DB-1ms capillary column with rapid decarboxylation to a (30). Briefly, outflow split to flame ionization and MS detectors. progress was tracked by the change in absorbance The formation of the unstable β-keto acid catalyzed at 340 nm by the formation or consumption of −1 −1 by OleA and NltAB could be observed as its ketone NADPH (ε340 = 6,220 M cm ) in UV- breakdown product by GC-MS as published transparent 96 well plates (Greiner, Sigma- previously (31). Olefins from the complete thermal Aldrich) measured using a SpectraMax Plus decarboxylation of β-lactone products were microplate reader (Molecular Devices). detected without derivatization by comparison to synthetic β-lactone standards described previously General synthetic procedures Synthesis and (8, 27). β-hydroxy acid required methylation of the purification of racemic 2-octyl-3- carboxylic acid group by diazomethane for hydroxydodecanoic acid was performed as detection by GC-MS. Ethereal alcoholic solutions described previously (8, 53) via α-carbon of diazomethane were prepared from N-methyl-N- deprotonation of decanoic acid by lithium nitroso-p-toluenesulfonamide (Sigma-Aldrich). All diisopropylamide (LDA) followed by the samples were extracted with tert-methyl butyl ether addition of decanal to form 2-octyl-3- and mixed with 50 µL of diazomethane solution. hydroxydodecanoate. An identical synthetic One microliter of each sample was injected into the method was used to synthesize a racemic injection port (230 °C). The 25 min program was as diastereomeric mix of 2-hexyl-3- follows: hold 80 °C for 2 min; ramp linearly to 320 hydroxyoctanoic acid from hexanal and octanoic °C for 20 min; hold 320 °C for 3 min. acid as described by Robinson et al. (27).

Verification of NltC β-lactone synthetase Data Availability activity by 1H-NMR Reaction mixtures were set Scripts and raw data for analyses and figures up in separatory funnels with 1 mg NltC, 20 mg presented in this manuscript are available on ATP, and 30 mg MgCl2 • 6H2O in 100 mL of 200 GitHub: https://github.com/serina- mM NaCl and 20 mM NaPO4 buffer (pH 7.4). robinson/adenylpred_analysis. A searchable Reactions were initiated with the addition of 1.5 mL database for the training set is also available at of 2-octyl-3-hydroxydodecanoic acids dissolved in z.umn.edu/adenylpred. EtOH (1.0 mg/mL stock). 10µL of 1-bromo- naphthalene (0.5 mg/mL stock) was added to each Author Contributions reaction mixture as an internal standard. Reactions Conceptualization, S.L.R., M.H.M., and L.P.W.; were allowed to run for 24 hours before 3 Methodology, S.L.R., B.R.T. M.H.M., and successive extractions were performed with 10 mL, L.P.W.; Software, S.L.R. and B.R.T.; Validation, 5 mL, and 5 mL of dichloromethane. Samples were S.L.R., B.R.T., S.J.P., and M.D.S.; Formal evaporated at room temperature before solvation in Analysis, S.L.R. and B.R.T.; Investigation, 1 CDCl3 for H-NMR (400 MHz). Figure S4A shows S.L.R., S.J.P. and M.D.S.; Data Curation, S.L.R. the 1H-NMR spectra of 2-octyl-3- and B.R.T.; Writing – Original Draft, S.L.R.; hydroxydodecanoic acids allowed to react Writing – Review & Editing, S.L.R., B.R.T., overnight with NltC compared to a no enzyme S.J.P, T.P.S., M.H.M., and L.P.W.; Supervision – control. Chemical shifts for synthetic 2-octyl-3- M.H.M. and L.P.W.

hydroxydodecanoic acid starting materials and cis-

11 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Acknowledgements (2010) Structure of the D- We acknowledge Satria Kautsar for assistance in alanylgriseoluteic acid biosynthetic extracting candidate biosynthetic gene clusters. protein EhpF, an atypical member of Aalt-Jan van Dijk is recognized for his insightful the ANL superfamily of adenylating discussions on machine learning. Kelly Aukema is enzymes. Acta Crystallogr. D. Biol. acknowledged for manuscript edits and discussion. Crystallogr. 66, 664-672. We are grateful to Jorge Navarro Muñoz for providing the output from fungiSMASH analysis. 6. Arora, P., Goyal, A., Natarajan, V.T., Rajakumara, E., Verma, P., Funding Gupta, R., Yousuf, M., Trivedi, S.L.R. is supported by the National Science O.A., Mohanty, D., Tyagi, A., and Foundation Graduate Research Fellowship under Sankaranarayanan, R. (2009) NSF grant number 00039202 and a Graduate Mechanistic and functional insights Research Opportunities Worldwide (GROW) into fatty acid activation in fellowship to the Netherlands supported by the NSF Mycobacterium tuberculosis. Nat. and the Netherlands Organization for Scientific Chem. Biol. 5, 166-173. Research (NWO) grant number 040.15.054/6097. 7. Cimermancic, P., Medema, M.H., Conflict of Interest Claesen, J., Kurita, K., Brown, L.C., M.H.M. is on the scientific advisory board of Mavrommatis, K., Pati, A., Godfrey, Hexagon Bio. P.A., Koehrsen, M., Clardy, J., Birren, B.W., Takano, E., Sali, A., References Linington, R.G., and Fischbach, 1. D’Ambrosio, H.K., and Derbyshire, M.A. (2014) Insights into Secondary E.R. (2020) Investigating the Role of Metabolism from a Global Analysis Class I Adenylate-Forming Enzymes of Prokaryotic Biosynthetic Gene in Natural Product Biosynthesis. ACS Clusters. Cell. 158, 412-421. Chem. Biol. 15, 17-27 8. Christenson, J.K., Richman, J.E., 2. Lipmann, F. (1944) Enzymatic Jensen, M.R., Neufeld, J.Y., synthesis of acetyl phosphate. J. Biol. Wilmot, C.M., and Wackett, L.P. Chem. 155, 55-70. (2017) β-Lactone synthetase found in the olefin biosynthesis pathway. 3. Gulick, A.M. (2009) Conformational Biochemistry. 56, 348-351. dynamics in the acyl-CoA synthetases, adenylation domains of non-ribosomal 9. Waldman, A.J., and Balskus, E.P. peptide synthetases, and firefly (2018) Discovery of a diazo- luciferase. ACS Chem. Biol. 4, 811- forming enzyme in cremeomycin 827. biosynthesis. J. Org. Chem. 83, 7539-7546. 4. Wang, N., Rudolf, J.D., Dong, L.B., Osipiuk, J., Hatzos-Skintges, C., 10. Khurana, P., Gokhale, R.S., and Endres, M., Chang, C.Y., Babnigg, G., Mohanty, D. (2010) Genome scale Joachimiak, A., Phillips, G.N., and prediction of substrate specificity Shen, B. (2018) Natural separation of for acyl adenylate superfamily of the acyl-CoA ligase reaction results in enzymes based on active site residue a non-adenylating enzyme. Nat. profiles. BMC Bioinformatics. 11, Chem. Biol. 14, 730-737. 57.

5. Bera, A.K., Atanasova, V., Gamage, 11. Röttig, M., Medema, M.H., Blin, K., S., Robinson, H., and Parsons, J.F. Weber, T., Rausch, C., and

12 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Kolbacher, O. (2011) expression of rat acyl-CoA NRPSpredictor2-a web server for synthetase 3. J. Biol. Chem. 271, predicting NRPS adenylation domain 16748-16752. specificity. Nucleic Acids Res. 39, W362-W367. 18. Akiva, E., Copp, J.N., Tokuriki, N., Babbitt, P.C. (2017) Evolutionary 12. Chevrette, M.G., Aicheler, F., and molecular foundations of Kohlbacher, O., Currie, C.R., and multiple contemporary functions of Medema, M.H. (2017) SANDPUMA: the nitroreductase superfamily. ensemble predictions of nonribosomal Proc. Natl. Acad. Sci. U.S.A. 114, peptide chemistry reveal biosynthetic E9549-E9558. diversity across Actinobacteria. Bioinformatics. 33, 3202-3210. 19. Ashkenazy, H., Penn, O., Doron- Faigenboim, A., Cohen, O., 13. Pan, G., Xu, Z., Guo, Z., Ma, M., Cannarozzi, G., Zomer, O. and Yang, D., Zhou, H., Gansemans, Y., Pupko, T. (2012) FastML: a web Zhu, X., Huang, Y., Zhao, L.X. and server for probabilistic Jiang, Y. (2017) Discovery of the reconstruction of ancestral leinamycin family of natural products sequences. Nucleic Acids Res. 40, by mining actinobacterial genomes. W580-584. Proc. Natl. Acad. Sci. U.S.A. 114, E11131-E11140. 20. Blin, K., Shaw, S., Steinke, K., Villebro, R., Ziemert, N., Lee, S.Y., 14. Zhao, H., Liu, Y.P., and Zhang, L.Q. Medema, M.H. and Weber, T. (2019) In silico and genetic analyses (2019) antiSMASH 5.0: updates to of cyclic lipopeptide synthetic gene the secondary metabolite genome clusters in Pseudomonas sp. 11K1. mining pipeline. Nucleic Acids Res. Front. Microbiol. 10, 544. 47, W81-W87.

15. Kautsar, S.A., Blin, K., Shaw, S., 21. Blin, K., Wolf, T., Chevrette, M.G., Navarro-Muñoz, J.C., Terlouw, B.R., Lu, X., Schwalen, C.J., Kautsar, van der Hooft, J.J., van Santen, J., S.A., Suarez Duran, H.G., de Los Tracanna, V., Suarez Duran, H.G., Santos, E.L., Kim, H.U., Nave, M. Andreu, V.P., Selem-Mojica, N., and Dickschat, J.S. (2017) Alanjary, M., Robinson, S.L., Lund, antiSMASH 4.0-improvements in G., Epstein, S.C., Sisto, A.C., chemistry prediction and gene Charkoudian, L.K., Collemare, J., cluster boundary identification. Linington, R., Weber, T., Medema, Nucleic Acids Res. 45, W36-W41. M.H. (2020) MIBiG 2.0: a repository for biosynthetic gene clusters of 22. Kautsar, S.A., Suarez Duran, H.G., known function. Nucleic Acids Res. Blin, K., Osbourn, A. and Medema, 48, D454-458. M.H. (2017) plantiSMASH: automated identification, annotation 16. UniProt Consortium. (2019) UniProt: and expression analysis of plant a worldwide hub of protein biosynthetic gene clusters. Nucleic knowledge. Nucleic Acids Res. 47, Acids Res. 45, W55-W63. D506-D515. 23. Galica, T., Hrouzek, P., and Mares, 17. Fujino, T., Kang, M.J., Suzuki, H., J. (2017) Genome mining reveals Iijima, H., and Yamamoto, T. (1996) high incidence of putative Molecular characterization and lipopeptide biosynthesis NRPS/PKS

13 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

clusters containing fatty acyl-AMP Wilmot, C.M. (2016) Substrate ligase genes in biofilm-forming trapping in crystals of the thiolase cyanobacteria. J. Phycol. 53, 985-998. OleA identifies three channels that enable long chain olefin 24. Bai, T., Zhang, D., Lin, S., Long, Q., biosynthesis. J. Biol. Chem. 291, Wang, Y., Ou, H., Kang, Q., Deng, Z., 26698-26706. Liu, W. and Tao, M. (2014) Operon for biosynthesis of lipstatin, the β- 30. Bonnett, S.A., Papireddy, K., lactone inhibitor of human pancreatic Higgins, S., del Cardayre, S. and lipase. Appl. Environ, Microbiol. 80, Reynolds, K.A. (2011) Functional 7473-83. characterization of an NADPH dependent 2-alkyl-3-ketoalkanoic 25. Wyatt, M.A., Ahilan, Y., acid reductase involved in olefin Argyropoulos, P., Boddy, C.N., biosynthesis in Stenotrophomonas Magarvey, N.A., and Harrison, P.H. maltophilia. Biochemistry. 50, 9633- (2013) Biosynthesis of ebelactone A: 40. isotopic tracer, advanced precursor and genetic studies reveal a 31. Frias, J.A., Richman, J.E., Erickson, thioesterase-independent cyclization J.S. and Wackett, L.P. (2011) to give a polyketide beta-lactone. J. Purification and characterization of Antibiotics. 66, 421-430. OleA from Xanthomonas campestris and demonstration of a non- 26. Mikami, Y., Yazawa, Y., Tanaka, Y., decarboxylative Claisen Ritzau, M., and Grafe, U. (1999) condensation reaction. J. Biol. Isolation and structure of Chem. 286, 10930-10938. nocardiolactone, a new dialkyl- substituted beta-lactone from 32. Wattam, A.R., Abraham, D., Dalay, pathogenic Nocardia strains. Nat. O., Disz, T.L., Driscoll, T., Prod. Lett. 13, 277-284. Gabbard, J.L., Gillespie, J.J., Gough, R., Hix, D., Kenyon, R. and 27. Robinson, S.L., Christenson, J.K., Machi, D. (2014) PATRIC, the Richman, J.E., Jenkins, D.J., Neres, J., bacterial bioinformatics database Fonseca, D.R., Aldrich, C.C. and and analysis resource. Nucleic Acids Wackett, L.P. (2019) Mechanism of a Res. 42, D581-D591 standalone β-lactone synthetase: New continuous assay for a widespread 33. Baier, F., and Tokuriki, N. (2014) ANL superfamily enzyme. Connectivity between catalytic ChemBioChem. 20, 1701-1711. landscapes of the metallo-β- lactamase superfamily. J. Mol. Biol. 28. Zhang, D., Zhang, F., and Liu, W. 426, 2442-56. (2019) A KAS-III heterodimer in lipstatin biosynthesis 34. Hicks, M.A., Barber, A.E., nondecarboxylatively condenses C8 Giddings, L.A., Caldwell, J., and C14 fatty acyl-CoA substrates by a O'Connor, S.E. and Babbitt, P.C. variable mechanism during the (2011) The evolution of function in establishment of a C22 aliphatic strictosidine synthase-like proteins. skeleton. J. Am. Chem. Soc. 141, Proteins. 79, 3082-98. 3993-4001. 35. Linne, U., Schafer, A., Stubbs, 29. Goblirsch, B.R., Jensen, M.R., M.T., and Marahiel, M.A. (2007) Mohamed, F.A., Wackett, L.P. and Aminoacyl-coenzyme A synthesis

14 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

catalyzed by adenylation domains. macrolides. Angew. Chem. Int. Ed. FEBS Lett. 581, 905-910. 58, 3996-4001.

36. Oba, Y., Ojika, M., and Inouye, S. 43. Kuo, J., Lynch, S.R., Liu, C.W., (2003) Firefly luciferase is a Xiao, X. and Khosla, C. (2016) bifunctional enzyme: ATP-dependent Partial in vitro reconstitution of an monooxygenase and a long chain fatty orphan polyketide synthase acyl-CoA synthetase. FEBS Lett. 540, associated with clinical cases of 251-254. nocardiosis. ACS Chem. Biol. 11, 2636-2641. 37. Oba, Y., Iida, K., and Inouye, S. (2009) Functional conversion of fatty 44. Trevino-Villarreal, J.H., Vera- acyl-CoA synthetase to firefly Cabrera L., Valero-Guillén, P.L., luciferase by site-directed and Salinas-Carmona, M.C. (2012) mutagenesis: A key substitution Nocardia brasiliensis cell wall responsible for luminescence activity. lipids modulate macrophage and FEBS Lett. 583, 2004-2008. dendritic responses that favor development of experimental 38. Robinson, S.L., Christenson, J.K., and actinomycetoma in BALB/c mice. Wackett, L.P. (2019) Biosynthesis and Infect. Immun. 80, 3587-3601. chemical diversity of β-lactone natural products. Nat. Prod. Rep. 36, 458-475. 45. Beaman, B.L. (1996) Differential binding of Nocardia asteroides in 39. Wang, Y., Tennyson, R.L., and Romo, the murine lung and brain suggests D. (2004) β-Lactones: intermediates multiple ligands on the nocardial for natural product total synthesis and surface. Infect. Immun. 64, 4859- new transformations. Heterocycles. 4862. 64, 605-658. 46. Blin, K., Pascal Andreu, V., de los 40. Allemann, M.N., Shulse, C.N., and Santos, E.L.C., Del Carratore, F., Allen, E.E. (2019) Linkage of marine Lee, S.Y., Medema, M.H. and bacteria polyunsaturated fatty acid and Weber, T. (2019) The antiSMASH long-chain hydrocarbon biosynthesis. database version 2: a comprehensive Front. Microbiol. 10, 702. resource on secondary metabolite biosynthetic gene clusters. Nucleic 41. Komaki, H., Ichikawa, N., Hosoyama, Acids Res. 47, D625-D630. A., Takahashi-Nakaguchi, A., Matsuzawa, T., Suzuki, K.I., Fujita, 47. Fu, L., Niu, B., Zhu, Z., Wu, S. and N. and Gonoi, T. (2014) Genome Li, W. (2012) CD-HIT: accelerated based analysis of type-I polyketide for clustering the next-generation synthase and nonribosomal peptide sequencing data. Bioinformatics. 28, synthetase gene clusters in seven 3150-3152. strains of five representative Nocardia species. BMC Genomics. 15, 11. 48. Csardi, G., and Nepusz, T. (2006) The igraph software package for 42. Pidot, S.J., Herisse, M., Sharkey, L., complex network research. Atkin, L., Porter, J.L., Seemann, T., InterJournal. 1695, 1-9. Howden, B.P., Rizzacasa, M.A. and Stinear, T.P. (2019) Biosynthesis and 49. Eddy, S.R. (2011) Accelerated ether-bridge formation in nargenicin profile HMM searches. PLOS Comp. Biol. 7, e1002195.

15 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

50. Malley, J.D., Kruppa, J., Dasgupta, A., Malley, K.G. and Ziegler, A. (2012) Probability machines. Methods of Information in Medicine. 51, 74-81.

51. Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 30, 1312-1313.

52. Bradford, M.M. (1976) A rapid and sensitive method for quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal. Biochem. 72, 248-254.

53. Mulzer, J., Brüntrup, G., Hartz, G., Kühl, U., Blaschek, U., and Böhrer, G. (1981) Additionen von Carbonsäure-Dianionen an α, β- ungesättigte Carbonylverbindungen- Steuerung der 1, 2-/1, 4- Regioselektivität durch sterische Substituenteneffekte. Chemische Berichte. 114, 3701-3724.

16 bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1.00 0.75 0.50

Sensitivity 0.25 0.00 0.00 0.25 0.50 0.75 1.00 1 Specificity

bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

H O N S PCP

O NH2 O NH2 O HS S PCP O S PCP NH2 O S-CoA

O HO O S-CoA S-CoA O S-CoA

O S-CoA OH 0.4 O S-CoA 0.4

O * N O S S S-CoA N HO

O O S-CoA O-AMP HO O

S-CoA

O O S-CoA S-CoA

bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. O

S-CoA O HO O S-CoA

S-CoA O O S-CoA S-CoA O

O

O

S-CoA

OH O S-CoA

HO O O S-CoA HS S PCP O NH2 O-AMP O H N S PCP NH2 NH2 10000 S PCP O Aryl−CoA ligases (ARYL) Beta−lactone synthetases (BLS) Fatty acyl−AMP ligases (FAAL) 100 Long chain acyl−CoA synthetases (LACS) Luciferases (LUC) Medium chain acyl−CoA synthetases (MACS) Nonribosomal peptide synthases (NRPS)

Log10 # of genes predicted Short chain acyl−CoA synthetases (SACS)

1 Very−long chain acyl−CoA synthetases (VLACS) NRPS ARYL FAAL SACS MACS LACS VLACS BLS Enzyme function

bioRxiv preprint doi: https://doi.org/10.1101/856955; this version posted March 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

O O

H N O O OO O Nocardia lijiangensis Nocardia vinacea Nocardia farcinica Nocardia gamkensis Nocardia niwae Nocardia beijingensis Nocardia asiatica Nocardia abscessus Nocardia higoensis Nocardia brasiliensis Nocardia vulneris Nocardia xishanensis Nocardia salmonicida Nocardia soli Nocardia alba Nocardia cyriacigeorgica Nocardia niigatensis Nocardia yamanashiensis Nocardia inohanensis Nocardia pseudobrasiliensis Nocardia transvalensis Nocardia africana Nocardia elegans Nocardia cerradoensis Nocardia vermiculata 2 kb hypothetical protein Short chain dehydrogenase NltA/NltB Cytochrome p450 NltC 4HBT Thioesterase superfamily NltD Galactokinase galactose−binding signature MMPL transporter Galactose−1−phosphate uridyl transferase TetR family regulator Zinc−binding dehydrogenase OleA NltA + NltB NltA NltB ) 2 R O min ( O 1 time

R i AMP PP NltC Retention 21.9 22.0 22.1 22.2 ATP OH

50000 50000 50000 50000

O

150000 100000 150000 100000 150000 100000 150000 100000 FID Peak Area Peak FID − GC 2 R The copyright holder for this preprint (which was not was (which preprint this for holder copyright The OH 1 + R C14 C20 Control NltD NAD(P)H NADP OH O 2 this version posted March 19, 2020. 2020. 19, March posted version this R ; O 1 R O OH Time (hours) 2 CoASH HO O OH NltA O 2 HO H https://doi.org/10.1101/856955 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. permission. without allowed No reuse reserved. rights All author/funder. is the review) by peer certified doi: doi: 0510152025 S-CoA S-CoA O O 2 0 1 R 25 50 75

R

100 bioRxiv preprint preprint bioRxiv AMP produced (µM) produced AMP