BSc. A. Roeters - Probing the biosynthetic diversity of 29-01-2018

PROBING THE BIOSYNTHETIC DIVERSITY OF ACTINOBACTERIA

MSc. Thesis by Arne Roeters, supervised by dr. MH Medema and JC Navarro Munoz PhD. Bioinformatics department Wageningen university.

ABSTRACT certain plants. In these relationships the actinobacteria provide nitrogen to the plant and in The Actinobacteria are a large phylum of Gram- return they take some of the plants saccharide 1,2 positive of which we harvest many reserves . Maybe even more important and interesting about these bacteria, are their clinically useful natural products. A large portion of secondary metabolites that can be used for medical these clinically useful products are made by the purposes3. A large part of the clinically available largest genus within this phylum, called antibiotics come from Actinobacteria, and Streptomyces. These products are made by especially the largest genus Streptomyces. This biosynthetic gene clusters (BGCs), which are genus produces over two-third of the clinically physically clustered genes on the genome. To find useful natural antibiotics with its natural product 4 more of these natural compounds, genome mining biosynthetic gene clusters . Not nearly all-natural compounds have been found yet, meaning that has become one of the most important tools in there might still be many more useful compounds bioinformatics. This new technique has given rise that are made by the biosynthetic pathways of to programs like antiSMASH (Medema, et al., Actinobacteria. These biosynthetic pathways 2011). Programs like this have created new consist of genes that are physically clustered challenges due to the large amount of BGCs they together on the chromosome forming so called 5–7 mine, to narrow the search for new interesting biosynthetic gene clusters (BGCs) . These BGCs create a vast number of metabolites within the natural products, BiG-SCAPE was created. This organism they are present. There are two types of program clusters BGCs together to find novel BGCs metabolites, the primary metabolites which are not that might be varieties on already existing clinically produced by BGCs and the secondary metabolites useful products. However, this program can still be which are produced by BGCs. Primary metabolites improved by optimizing the parameters it uses. are used by the organism to grow, develop, survive The goal of this project is to optimize these and reproduce. While secondary metabolites are parameters and use these parameters to find not directly involved in these processes and are not necessary for the organism to survive; although a interesting BGC patterns in the phylogeny of lack of the metabolites might be harmful for the Streptomyces. organism in the long run as it does help them defend against other microbes4. Many of these INTRODUCTION secondary metabolites are used by humans for Actinobacteria are a very large phylum of Gram- other purposes (e.g. caffeine is produced by some positive bacteria that are wide spread across plants to protect them against bugs, since it is terrestrial and aquatic ecosystems. They are poisonous in high concentrations, and to prevent primarily chemoheterotrophs (i.e. they get their germination of other plants in the area). These energy from ingesting organic compounds) that metabolites are synthesized in many ways and are have varying tolerances of oxygen levels, from therefore also differently classified. The classes that strictly aerobic to strictly anaerobic. In soil systems are used in this study are polyketides (“assembly of forests and agricultural land they are of great line”-like synthesis with a low number of building importance since they help decompose rotten and blocks), Non-Ribosomal Peptides(NRPs, synthesized dead material, so that it can be taken up by plants. in almost the same way as polyketides but have Some species even have symbiotic relations with non-proteinogenic amino acids as the building

1

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

blocks)8, Ribosomally synthesized and Post- translationally modified Peptides (RiPPs, contains a In this research project antiSMASH is used to find gene coding for a single core peptide and multiple gene clusters that result in secondary metabolites. other genes coding for modification enzymes), AntiSMASH predicts gene clusters based on the Terpenes (use isoprene as building blocks), PFAM15 domains that are present and are Saccharides (part of the carbohydrate synthesis), associated with biosynthetic genes. The newest Polyketides/NRPs hybrids9 (PKS-NRP), Other (all version of antiSMASH will be used in the project metabolites that cannot be classified in one of the with and without the newly incorporated other classes, e.g. alkaloids10). Within these classes, ClusterFinder algorithm. The ClusterFinder the metabolites are even further divided into algorithm is a BGC prediction pipeline which first groups to be even more specific, for instance the annotates the genome sequence and converts it to class polyketide contains the groups; antifungal string of Pfam domains, then it calculates the polyenes, macrolides (14-membered), macrolides probability of each domain to be part of a BGC (16-membered), statins and many more groups. based on two conditions; the first being the These groups divide the metabolites even more in frequency of the domain in the BGC training and the way they are built (table 1). non-BGC training sets and the second being the Since the rise of the rapid and cheap DNA identities of the neighboring domains. The last step sequencing techniques many new BGCs have been is to cluster genes that contain Pfam domains where found. Not all these BGCs have been linked to the the probability of a BGC hidden state is above a metabolites they create. This means that there certain threshold. After antiSMASH, BIG-SCAPE8 is might be many more clinically useful metabolites used to define a distance matrix between the found that await discovery11. Recently, genome mining of domains. BIG-SCAPE was developed to create a these BGCs became vital for finding new molecules distance matrix using three different components; which lead to numerous new compounds. To mine jaccard index (JI), domain similarity score (DSS) and all these BGCs, multiple tools have been developed the Adjacency Index (AI). Since a BGC is a collection (e.g. ClustScan for NRPS and PKS12, BAGEL313 for of genes that together encode a set of proteins that RiPPs and antiSMASH14 for all classes) to help will eventually synthesize the compound, the scientists in this mission of finding molecules that distances are based on the domains in the BGCs. So, could be used in medicine. The problem here is that the JI is the ratio between distinct shared and there are so many novel BGCs, that it becomes distinct unshared domains of a pair of BGCs, the DSS difficult to know where to start. This can be measures the sequence similarity between domains simplified by finding BGCs that show similarities to and is normalized for the difference in copy number known BGCs of which the products are already of the domains. For the calculation of the DSS, the clinically useful. In the end this might help in finding key domains (here called anchor domains) and less novel varieties of already existing drugs or novel important domains (non-anchor domains) are taken interesting natural products for which other into account separately. The anchor domains can be microorganisms are not yet resistant. virtually multiplied by a factor called the anchor Class Nr. of groups Nr. of members boost. This makes the anchor domains weigh in more in the calculation of the DSS. The AI is the NRPS 24 131 ratio between the shared adjacent domain-pairs Other 4 24 and the unshared domain-pairs without taking the order of the domain-pairs into account. The final PKS-NRP 5 17 distance, that is used in the distance matrix, is PKSI 36 142 comprised of the three similarity measures using formula 1. PKSother 1 10 RiPPs 12 36 퐷 = 1 − (푎퐽퐼푝푞 + 푏퐷푆푆푝푞 + 푐퐴퐼푝푞) Saccharides 3 10 Formula 1: The distance calculation used by BiG-SCAPE. Terpenes 4 19 Here D is the final distance between BGCs p and q. a, b and c are the weights that are given to the JI, DSS and AI Table 1: Shows how many known groups and respectively. Where a + b + c = 1. compounds there were in the test set. This shows that not all compounds that are in the MIBiG have Based on this distance matrix, BiG-SCAPE creates a assigned groups. network of all BGCs which contains all Gene Cluster Families (GCFs). These GCFs contain all BGCs that BGC MINING AND CLUSTERING have a similar structure in Pfam domains which

2

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

would result in a similar metabolite. These networks created by BiG-SCAPE using all entries in the are created by BiG-SCAPE which uses a clustering database as input. Further used options were the “- algorithm called affinity propagation16 (AP), which -hybrids”, to also obtain networks from hybrid unlike k-means clustering doesn’t need the number clusters like PKS-NRP. “--mix” to get network files of clusters predefined. This algorithm uses a that have all classes together and “--mode lcs” to concept called message passing to calculate the only use the longest slice mode, which is a glocal final clusters. These messages will be sent around alignment between two BGCs. In this alignment between neighboring data points, until the clusters mode, the longest common substring will be are resolved, and all data points are, or are linked to extended to both sides with matches and so called exemplars (a representative for the data mismatches until the BGCs don’t overlap anymore. point itself and the data points that are linked to A smaller subset of the domains in the BGCs, will them). There are two types of messages, the make the JI, DSS and AI increase. After the initial run availability and responsibility. The availability is sent of BiG-SCAPE, distances and GCFs can be from the data point to a neighboring data point recalculated with different weights using the which in turn will send back its responsibility. information in the resulting network files from BiG- Sometimes the messages will overshoot their SCAPE. The weights for the JI, DSS and AI were solution and therefore the updates can be damped changed with steps of 0.05 to lower the to resolve this problem. In this project we aim to computational intensity of the recalculations and optimize the BiG-SCAPE parameters and use these AP. The range for the anchor boost was 1.0 to 4.0 optimized parameters, to study the BGC dispersion with a step of 1.0 each time. in Streptomyces. This might contribute to a better GROUP BASED NETWORK SCORING understanding of the horizontal BGC transfer between species and maybe even between For this first method of scoring the networks, the different genera. Optimizing these parameters can group of the nodes and the edges between them also be important for linking novel secondary are considered the important part of the networks. metabolites to their BGCs. This might help in The edges are divided into three types which take prioritizing BGCs with interesting products in future the group of the nodes into account; ‘within-edges’, research. ‘between-edges’ and ‘unknown-edges’. Within- METHODS AND IMPLEMENTATION edges are links between BGCs from the same group, whereas between-edges are links between BGCs INITIAL BiG-SCAPE RUN AND DATA from different groups. Finally, the unknown-edges As mentioned before, the weights in BiG-SCAPE are are links between BGCs where one or both groups important for the clustering of the BGCs in Gene are unknown. Each edge in the network gets a score Cluster Families. Each class of metabolites might use using the similarity (1 -D) between two BGCs. a different weight distribution between the JI, DSS Within-edges add 1 + similarity to the total score, and AI that would lead to the optimal results. The since these edges are between nodes of the same weights that BiG-SCAPE currently uses by default group which contributes to well separated clusters. were determined by calculating the correlation Unknown-edges add their similarity, because it is between the BGC distance and the compound distance. To test if these weights are optimal for not known if the nodes are from the same group. clustering, three different ways of scoring the So, no bonus or penalty is given to these edges. clustering were created: 1) scoring the clusters Finally, the between-edges are penalized, so they created by BiG-SCAPE based on the groups of the subtract 1 + similarity. Edges between nodes that nodes and the links between them, 2) calculating an are not from the same group would lead to clusters F-score for the networks and 3) calculating the that are poorly separated and are therefore network connectivity. The data that was used for penalized. The score is then normalized for the this came from the manually curated MIBiG (Minimum information about a biosynthetic gene number of edges to prevent being biased towards cluster)17 database, which is filled with about 1500 larger networks with more edges. This will result in known BGCs that are collected from papers. Since a score for the network between -2 and 2. A score most information about the gene clusters is known close to -2 would represent a poorly separated from literature, including the chemical structures of network and a score close to 2 represents a well the products, the data will be very suited to separated network with edges almost solely optimize the weights for the Jaccard index, DSS and AI with. To achieve this goal, network files had to be between BGCs within the same group. By dividing

3

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

the total score by the number of edges, the score is PURITY AND COMPLETENESS BASED SCORING normalized for the number of edge (workflow figure 1). Therefore, we can assume that the edges that In order to find out which weights give the purest will not contribute to the network will lower the and most complete network, also a scoring method score and therefore be eliminated from the best was based on these attributes. Since most BGCs networks. The idea behind this is that the best have a known group, it was possible to calculate the networks will only contain pure clusters and thus purity and completeness of the formed clusters. mostly contain edges between BGCs from the same This was done by changing the BGC ids to their group. During this scoring method, two default respective group and loading in the AP files. For implementations of AP were tested: 1) Scikit- every group that was present, only the cluster with learn18, which uses a full distance matrix and 2) the most members of that group would be used to pysapc19, which uses a sparse matrix and is calculate the purity and completeness. The first step therefore computationally less intensive and the of calculating these parameters would be to find the cost of some precision. These default settings cluster where group “A” (“A” being one of the included a preference for the cluster size and were curated groups in the dataset) is most abundant. In “median” for Scikit-learn and “min” for pysapc. Here this group, the number of BGCs of group “A” is a preference for “median” creates more clusters divided over the total number of BGCs present in than the “min” option would do. To be able to that cluster to calculate the purity. To calculate the compare these two implementations, a third AP was completeness, the number of BGCs of group “A” added, which was the pysapc implementation with present in this cluster were divided by the total the preference set to “median”. The results were number of BGCs in group “A” (figure 2). The score analyzed in R using graphs to show how they for a single cluster consists of the completeness and compared and how they behaved in the different purity together (formula 2). This will create a score classes of metabolites at different distance cutoff between 0 and 1 for each cluster. levels. Later analysis showed that there was no big Scorem = 푝푢푟푖푡푦푚 ∗ .5 + 푐표푚푝푙푒푡푒푛푒푠푠푚 ∗ . 5 difference between the two versions, so the currently used pysapc implementation with default Formula 2: The score per cluster m, where m is the settings will only be used in the coming steps. cluster where the most members of the group are located in.

To get the score of a complete network, all scores of the clusters in that network must be added(workflow figure 3). The weights and distance cutoffs that were used for the network with the highest score, create the most pure and complete networks possible.

Figure 1: The flow of the scoring algorithm for a Figure 2: In this figure the purity and completeness single network for the group based scoring. The of group A is calculated based on cluster 2, since this script does this for all different cutoff values and cluster contains the largest part of group A. Here the weights that are tested. purity would be 5/6 because the cluster has 6

4

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

members and the completeness 5/8 because in total The next part of this project was intended to try the group A has 8 members. new-found weights on a real dataset to try to find interesting BGCs based on their dispersion across the Streptomyces phylogeny. This was done by gathering 99 genomes from the NCBI genome database. 97 of these genomes were from Streptomyces, while Salinispora arenicola and acidiphila were chosen as the outgroups. (list of used genomes, including plasmids, in supplementary materials) Catenulispora and Streptomyces have a common ancestor which is relatively close by while Salinispora has a more distant common ancestor21. From these 99 genomes a phylogenetic tree was constructed using two different methods: 1) based on 400 conserved proteins using PhyloPhlAn22 2) based on the RNA polymerase subunit beta (rpoB). This mono-copy gene is highly conserved and therefore fit to base a phylogeny on, this has already been done successfully within the field of clinical microbiology23. After the phylogenetic trees were Figure 3: The workflow of the purity and made, the tree based on the rpoB was chosen based completeness combined scoring algorithm. on how the trees looked and comparing the 24,25 NETWORK CONNECTIVITY phylogeny to the phylogeny in literature . The next step was to extract all BGCs from the genomes The last aspect that was investigated was the using antiSMASH. The used settings were: network connectivity. In order to test if the clustering could also be done without the AP, the - --smcogs connected components were calculated as the GCFs - --borderpredict instead of the AP generated GCFs. This was done to Smcogs was enabled to include the secondary show how the number of clusters would differ with metabolites of orthologous groups as well to shifting weights and distance cutoffs without relying increase the chance of finding BGCs. The on AP. The NetworkX20 package was used to create borderpredict option enabled the ClusterFinder the network connectivity for each network. The algorithm for better BGC border detection. The NetworkX.Graph() function creates an empty BGCs that were found by antiSMASH, were then network to which one can add nodes and edges. clustered using BiG-SCAPE to find out how they These nodes were added if the distance between would form GCFs. This clustering step also included the nodes was smaller than the cutoff value for that the BGCs in the MIBiG database, to show which network. The NetworkX package would then add known BGCs were present in the genomes. The the nodes with an edge between them and keep options used for BiG-SCAPE were: track of what is connected. Once parsed through all the edges for a certain weight and cutoff, the - --hybrids connected components can be retrieved from the - --mode lcs graph using the connected_components() function. From these results the number of clusters, the Hybrids was used to add the hybrid classes like a largest cluster size, smallest cluster size and the Terpene-NRPS to both the Terpene and NRPS where number of singletons were reported. it would otherwise be classified as others. The output of BiG-SCAPE and the phylogenetic tree BGC MINING AND PHYLOGENY OF STREPTOMYCES were combined, using etetoolkit26 and PyQT427, to show how the BGCs and GCFs were spread among

5

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

the species. In this visualization every GCF would get Firstly, it might be the case that the groups in this its own distinct color for easy recognition of class differ more from each other and are therefore different GCFs. This visualization contains two separated easier compared to the other classes, or different ways of showing the information, one secondly that the number of different groups in the where all BGCs are shown and a version where only class saccharides is too small (table 1). This second the BGCs that matched with a MIBiG cluster so that option seems to be the case when the weights with they would be identified. To make even more clear the highest scores are compared (table 3). Here we in which species the BGCs are present, this tool also see that the DSS is very important for the creates a file, for the BGCs that cluster to a MIBiG Saccharides, which means that the domains within reference BGC, which one can use to make an the saccharides are probably very similar compared absence presence matrix of known BGCs. This to each other and that the differences are more filtering step was done to lower the amount of based on the difference in sequence. Notice here BGCs. The matrix was imported in R and combined that even in the top scoring results there is already with the phylogenetic tree using the “ape” and a fluctuation of 0.25 for the DSS within the same “gplots” packages. To more clearly see patterns distance cutoff. This mean that there is no single across the tree, a threshold was set to 4 species, in optimum weight that will always lead to the best which a BGC should be present to be shown. To results but rather a range which leads to the best verify some of these clusters, antiSMASH was run results. The thing that stands out in table 3 is that it once more now using “--knownclusterblast” and “-- does show a tendency for which the best results will minimal” to do a quick search in the genomes for be obtained. the BGCs and identify them against the MIBiG database. This was done to make sure that no BGCs Class pysapc Scikit- pysapc learn were missing in the matrix. In this the e-values for the top hits in the knownclusterblast output were preference minimum median median also very small, this meant that the highest cutoff All-mix 1.083 1.142 0.446 could be used NRPS 1.147 1.178 0.435 RESULTS AND DISCUSSION Other 0.561 0.973 0.284 PKS-NRP 1.155 1.247 0.387 GROUP BASED, AND COMPLETENESS-PURITY BASED SCORING RESULTS PKSI 1.132 1.154 0.453 PKSother 0.977 1.006 0.531 The group-based scoring showed that from the two AP implementations the Scikit-learn performed RiPPs 0.904 1.031 0.393 better when the default settings were used, Saccharides 0.976 1.399 0.458 although for most classes there were still Terpenes 0.964 1.116 0.397 comparable scores (table 2). This behavior was Table 2: This table shows the highest scores for each expected since the pysapc implementation uses a of the classes in BiG-SCAPE methods of AP. The sparse matrix, this causes some loss in accuracy and scores of this figure do not show the weights, these therefore the scores will be lower. When the pysapc can be different for all three methods. implementation is used with the same settings as the Scikit-learn implementation on the other hand, This is an example of the saccharides class using the it performs much worse than the other methods. It Scikit implementation, but this is shown in all is not clear why this AP performs so much worse, classes and both implementations of the AP. but it does mean that it can be disregarded as a candidate for a possible different method of AP. What this table also shows is that some classes can be separated easier than other classes. For instance, the class saccharides shows a relatively high score in the Scikit-learn AP, this can be due to two reasons.

6

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

JI AI DSS Cutoff Score The last step of optimizing BiG-SCAPE was to see what the GCF would look like when AP was not 0.05 0.05 0.90 0.20 1.399 applied at all, but instead the connected 0.05 0.00 0.95 0.20 1.399 components were taken as the GCFs. The results 0.05 0.15 0.80 0.20 1.399 show that when AP is not applied, the clusters tend to be very large with a high distance cutoff and 0.05 0.10 0.85 0.20 1.399 rapidly go down in size until they reach the point of 0.05 0.25 0.70 0.20 1.399 being almost solely singletons (figure 4). 0.05 0.25 0.70 0.40 1.357 0.05 0.20 0.75 0.40 1.357 Table 3: This shows the top scoring weights at the given cutoff for the Scikit implementation for the Saccharides class.

This last remark is also confirmed by the scores that we obtain from the purity and completeness-based scoring (figure 4). This image shows that there are different weights that create similar results. The downside of this scoring method is that it is not normalized for the size of the groups. So, we cannot show with this data if there are single weights that Figure 4: This figure shows the number of singletons create a few very pure clusters or if there are compared to the biggest cluster size. All classes are multiple smaller groups that contribute a lot to the included and therefore the singletons and biggest final network score, since a group with just two cluster size are both normalized for the total number members that is clustered together adds close to 1 of BGCs present in that class by dividing it by that whole point to the score, where a group with 6 number of BGCs. members of which 4 out of the 6 cluster together will not come close to that. To improve this scoring This figure shows that when the number of method, group size should be taken into account. singletons is low, almost everything clusters into a single large cluster (upper left corner, figure 4). The opposite is the case when the largest cluster is small, in that case, almost everything is a singleton (lower right corner, figure 4) There are almost no cases where multiple smaller clusters are formed. If the clustering was successful we would expect more of a cloud in the lower left corner of figure 4. This would indicate that many smaller clusters would be formed. This proves that the AP is necessary to cluster the BGCs into GCFs.

RPOB BASED PHYLOGENY OF STREPTOMYCES

Figure 4: Here the purity-based scoring is shown for The rpoB phylogenetic tree of the 99 genomes all 8 classes in BiG-SCAPE and one for the mix of all (figure A) was built and validated using trees in classes. Here the scores are sorted from highest to literature24,25. These trees showed similarities with lowest score per cutoff value. This is the reason that the rpoB tree, even though not all species could be the graphs show the stairs like pattern. found in the trees, there were enough similarities to validate the tree. The fact that the tree showed a NETWORK CONNECTIVY RESULTS very low distance between different strains of the same species and the outgroups were very well

7

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

Figure 5: This figure shows how the clusters are dispersed across the phylogenetic tree. Here in the columns are the compound names and the names of the species can be found clearer in the large phylogenetic tree (Figure A). All clusters that were present in less than 5 species are not in the matrices to see clearer patterns across the phylogeny. separated from the Streptomyces confirmed once of the same species together but rather dispersed more that the tree was well built. From these over the complete tree. outgroups the Catenulispora was closest related to Streptomyces which was also shown in literature21. STREPTOMYCES BGCS The phylogenetic tree made by PhyloPhlAn showed Genome mining of the 99 genomes with antiSMASH some strange results by not grouping Streptomyces found several clusters in each genome ranging from

8

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

about 20 clusters for the smaller genomes to about at each distance cutoff level. The purity-based 45 for the larger genomes. When these mined BGCs scoring method still has room for improvement by were clustered with MIBiG BGCs using BiG-SCAPE, taking the group sizes into account as well. As it is many of these BGCs did not cluster with any MIBiG now, small group will benefit the purity and cluster. For the analysis of the Streptomyces completeness score more than bigger almost phylogeny and their BGCs, only the BGCs that complete clusters. The scoring methods, group clustered with MIBiG clusters could be used. After based and network connectivity, proved that AP is this filtering step, some species were left with only necessary for the finding well defined GCFs with two BGCs, where the species with the largest BiG-SCAPE. After testing 3 ways of AP (pysapc, number of BGCs still had fifteen left. This was with Scikit-learn, pysapc with preference median), the the cutoff of 0.80 due to the fact that the e-values scores in the group-based scoring that the in the knownclusterblast were very low. Now that computationally more intensive Scikit-learn the BGCs were clustered to the MIBiG clusters and implementation of the AP scored best in all classes. an absence presence matrix was made for all Despite the fact that the default implementation of species as in figure 5, some very interesting patterns pysapc had comparable results (table 2, PKSI) in in the BGCs became visible. One of interesting others it performed considerably worse (table 2, clades that showed a lot of uniformity with their Other and saccharides). The recommendation is to neighboring species was the one that contained the change the pysapc implementation to the Scikit- species S. alboreticuli, S. malaysiensis, S. learn implementation for overall better results, if autolyticus, S. binchenggensis and S. violaceusniger. the computational problems can be solved. These strains almost have all BGCs in common with Unfortunately, the optimization did not work well each other. This happens multiple times in the tree, enough to try out new weights for the second part which confirms that the tree is well built. Another of this project. Although it already showed proof of interesting pair of BGCs show up in the upper matrix concept by showing some interesting patterns in of figure 5. Here it seems that BGC0000663 is the BGCs and some BGCs that were wide spread almost always accompanied by either BGC0000660 across the phylogeny. The clades in the phylogeny or a BGC that clustered to BGC0000661 and showed many resemblances with each other, which BGC0001181. What is known about all these four shows that the tree based on the rpoB was well BGCs is that they produce a compound from the built. For further research, 2D clustering on the class terpene and they were found before in the absence presence matrices can elucidate more Streptomyces28. Now these clusters are also found patterns that are not obvious right away. by antiSMASH in Catenulispora. For future research it might also be interesting if other genera were added to the tree, CONCLUSIONS to see if the BGCs that are widespread within the Streptomyces, are also found in other genomes of In the beginning of the project it was already clear other genera. By also considering the environment that it would be unlikely to find a single weight of where the bacteria were found, one can show distribution for the JI, DSS and AI that could be used what drives horizontal BGC transfer. in BiG-SCAPE to cluster all classes of BGCs into pure and complete GCFs. This became more and more ACKNOWLEDGEMENTS clear after the optimization steps showed even more challenges. They showed that the weights do I would like to thank Marnix Medema and Jorge not only have to be finetuned for every class, but Navarro Munoz for their supervision and help. I also for every distance cutoff. Even though there would also like to thank Nelly Selem for helping with was a general trend in most of the weights of the JI, the phylogeny of Streptomyces. AI and DSS, still a large fluctuation in weights was present (table 3) even within the same distance cutoff. This was confirmed once more by the purity and completeness-based optimization, due to the fact that it had many scores that were very similar

9

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

REFERENCES density of unique natural product biosynthetic gene clusters. Microbiology 1. Atta, H. M. & Ahmad, M. S. Antimycin-A (United Kingdom) 162, 2075–2086 (2016). antibiotic biosynthesis produced by Streptomyces Sp. AZ-AR-262: , 12. Starcevic, A. et al. ClustScan: An integrated fermentation, purification and biological program package for the semi-automatic activities. Aust. J. Basic Appl. Sci. 3, 126– annotation of modular biosynthetic gene 135 (2009). clusters and in silico prediction of novel chemical structures. Nucleic Acids Res. 36, 2. Bressan, W. Biological control of maize 6882–6892 (2008). seed pathogenic fungi by use of actinomycetes. BioControl 48, 233–240 13. van Heel, A. J., de Jong, A., Montalbán- (2003). López, M., Kok, J. & Kuipers, O. P. BAGEL3: Automated identification of genes 3. Servin, J. A., Herbold, C. W., Skophammer, encoding bacteriocins and (non- R. G. & Lake, J. A. Evidence excluding the )bactericidal posttranslationally modified root of the tree of life from the peptides. Nucleic Acids Res. 41, (2013). Actinobacteria. Mol. Biol. Evol. 25, 1–4 (2008). 14. Weber, T. et al. AntiSMASH 3.0-A comprehensive resource for the genome 4. Manivasagan, P., Venkatesan, J., mining of biosynthetic gene clusters. Sivakumar, K. & Kim, S. K. Pharmaceutically Nucleic Acids Res. 43, W237–W243 (2015). active secondary metabolites of marine actinobacteria. Microbiological Research 15. Finn, R. D. et al. Pfam: The protein families 169, 262–278 (2014). database. Nucleic Acids Research 42, (2014). 5. Doroghazi, J. R. et al. Aroadmap for natural product discovery based on large-scale 16. Frey, B. J. & Dueck, D. Clustering by passing genomics and metabolomics. Nat. Chem. messages between data points. Science Biol. 10, 963–968 (2014). (80-. ). 315, 972–976 (2007).

6. Medema, M. H. et al. AntiSMASH: Rapid 17. Medema, M. H. et al. Minimum identification, annotation and analysis of Information about a Biosynthetic Gene secondary metabolite biosynthesis gene cluster. Nature Chemical Biology 11, 625– clusters in bacterial and fungal genome 631 (2015). sequences. Nucleic Acids Res. 39, (2011). 18. Pedregosa, F. et al. Scikit-learn: Machine 7. Medema, M. H. & Fischbach, M. A. Learning in Python. J. Mach. Learn. Res. 12, Computational approaches to natural 2825–2830 (2012). product discovery. Nature Chemical 19. Cao, H. & Amendt, B. A. pySAPC, a python Biology 11, 639–648 (2015). package for sparse affinity propagation 8. Yeong, M. BiG-SCAPE: exploring clustering: Application to odontogenesis biosynthetic diversity through gene cluster whole genome time series gene-expression similarity networks. (WUR, 2016). data. Biochim. Biophys. Acta - Gen. Subj. 1860, 2613–2618 (2016). 9. Leloir, L. F. Two Decades of research on the biosynthesis of saccharides. Science (80-. ). 20. Hagberg, A. a., Schult, D. a. & Swart, P. J. 172, 1299–1303 (1971). Exploring network structure, dynamics, and function using NetworkX. in Proceedings of 10. Facchini, P. J. ALKALOID BIOSYNTHESIS IN the 7th Python in Science Conference PLANTS: Biochemistry, Cell Biology, (SciPy2008) 836, 11--15 (2008). Molecular Regulation, and Metabolic Engineering Applications. Annu. Rev. Plant 21. Barka, E. A. et al. Taxonomy, Physiology, Physiol. Plant Mol. Biol 52, 29–66 (2001). and Natural Products of Actinobacteria. Microbiol. Mol. Biol. Rev. 80, 1–43 (2016). 11. Schorn, M. A. et al. Sequencing rare marine actinomycete genomes reveals high 22. Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new

10

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

method for improved phylogenetic and CP006871 taxonomic placement of microbes. Nat. CP019779 Commun. 4, (2013). LT907981 23. Adékambi, T., Drancourt, M. & Raoult, D. CP020569 The rpoB gene as a tool for clinical CP007699 microbiologists. Trends in Microbiology 17, CP017157 37–45 (2009). CP021744 24. Zhou, Z., Gu, J., Du, Y.-L., Li, Y.-Q. & Wang, CP019458 CP019459,CP019460,CP019461, Y. The -omics Era- Toward a Systems-Level CP019462,CP019463,CP019464, Understanding of Streptomyces. Curr. CP019465 Genomics 12, 404–416 (2011). CP023992 CP023993 CP022545 CP022546 25. Labeda, D. P., Doroghazi, J. R., Ju, K. S. & CP018627 Metcalf, W. W. Taxonomic evaluation of Streptomyces albus and related species CP002994 CP002995,CP002996 using multilocus sequence analysis and CP002047 proposals to emend the description of CP011492 Streptomyces albus and describe CP002475 CP002476,CP002477 Streptomyces pathocidini sp. nov. Int. J. CP003990 CP003991 Syst. Evol. Microbiol. 64, 894–900 (2014). CP002993 26. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: CP013738 CP013739,CP013740 Reconstruction, Analysis, and Visualization CP010833 of Phylogenomic Data. Mol. Biol. Evol. 33, 1635–1638 (2016). AP009493 CP020570 27. Summerfield, M. Rapid Gui Programming CM003601 CM003602 with Python and Qt: The Definite Guid to CP015362 PyQt Programming. Rapid GUI Programming with Python and Qt: 54, CP011522 CP011523 (2008). CP005080 CP011340 28. Dairi, T. Isoprenoid in Actinomycetes. in CM000950 Reference Module in Chemistry, Molecular Sciences and Chemical Engineering 1–21 CP018870 (2013). doi:10.1016/B978-0-12-409547- CP019798 2.02738-4 CP020039 CP020040,CP020041 CP020555 CP020556 29. Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and CP011664 CP011665,CP011666,CP011667 annotation of phylogenetic and other CP016559 CP016560 trees. Nucleic Acids Res. 44, W242–W245 CM000913 CM000914 (2016). CM001015 CM001016,CM001017,CM001018,CM 001019 SUPPLEMENTARY MATERIALS CP018047 CM002280 CM002281,CM002282,CM002283,CM MAIN PLASMIDS 002284 CHROMOSOME CP015588 CP000850 * CP022685 CP001700 * CP021748 FQ859185 FQ859184 CP010519 CP003219 CP003229 CP016825 CP011533 CP020042 CP020043,CP020044 CP003987 CP003988 FR845719 CP007574 CP007575

11

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

CP010407 CP010408 CP017248 LN881739 CP009438 CP009439 CP013129 CP003275 CP003276,CP003277 AP017424 CP013219 CP013220 BA000030 AP005645 CP003720 CP003721,CP003722 CM002273 CM002274,CM002275 LN997842 LN997843,LN997844,LN997845 CM002271 CM002272 CP022433 CP009802 CP009803,CP009804 CP021080 CP014485 CP006259 CP006260,CP006261 CP004370 CP016279 CP021118 CP021119 CP015849 CM001165 CM001166 CM000951 CP017316 LT670819 CM002285 CM002286 FN554889 CP013142 CP016438 CP009124 CP013743 CP013744 CM001889 CP022161 LT629768 CM007717 CP012949 HE971709 HE971710 CP012382 CP012383 CP015098 CP015866 CP015867 CP021121 CP019724 CP009922 CP009754 LT629775 CP016795 Table A: This table contains all accession numbers of LN831790 LN831788,LN831789 the genomes including their plasmids. * means that CP010849 it is an accession of an outgroup.

12

BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018

Figure A: The phylogenetic tree based on the rpoB of the 97 Streptomyces species with as outgroups Salinispora and Catenulispora. This tree is visualized in iTOL29.

Python scripts can be found in: https://github.com/aroeters/optimization_scripts.git

13