Probing the Biosynthetic Diversity of Actinobacteria 29-01-2018
Total Page:16
File Type:pdf, Size:1020Kb
BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018 PROBING THE BIOSYNTHETIC DIVERSITY OF ACTINOBACTERIA MSc. Thesis by Arne Roeters, supervised by dr. MH Medema and JC Navarro Munoz PhD. Bioinformatics department Wageningen university. ABSTRACT certain plants. In these relationships the actinobacteria provide nitrogen to the plant and in The Actinobacteria are a large phylum of Gram- return they take some of the plants saccharide 1,2 positive bacteria of which we harvest many reserves . Maybe even more important and interesting about these bacteria, are their clinically useful natural products. A large portion of secondary metabolites that can be used for medical these clinically useful products are made by the purposes3. A large part of the clinically available largest genus within this phylum, called antibiotics come from Actinobacteria, and Streptomyces. These products are made by especially the largest genus Streptomyces. This biosynthetic gene clusters (BGCs), which are genus produces over two-third of the clinically physically clustered genes on the genome. To find useful natural antibiotics with its natural product 4 more of these natural compounds, genome mining biosynthetic gene clusters . Not nearly all-natural compounds have been found yet, meaning that has become one of the most important tools in there might still be many more useful compounds bioinformatics. This new technique has given rise that are made by the biosynthetic pathways of to programs like antiSMASH (Medema, et al., Actinobacteria. These biosynthetic pathways 2011). Programs like this have created new consist of genes that are physically clustered challenges due to the large amount of BGCs they together on the chromosome forming so called 5–7 mine, to narrow the search for new interesting biosynthetic gene clusters (BGCs) . These BGCs create a vast number of metabolites within the natural products, BiG-SCAPE was created. This organism they are present. There are two types of program clusters BGCs together to find novel BGCs metabolites, the primary metabolites which are not that might be varieties on already existing clinically produced by BGCs and the secondary metabolites useful products. However, this program can still be which are produced by BGCs. Primary metabolites improved by optimizing the parameters it uses. are used by the organism to grow, develop, survive The goal of this project is to optimize these and reproduce. While secondary metabolites are parameters and use these parameters to find not directly involved in these processes and are not necessary for the organism to survive; although a interesting BGC patterns in the phylogeny of lack of the metabolites might be harmful for the Streptomyces. organism in the long run as it does help them defend against other microbes4. Many of these INTRODUCTION secondary metabolites are used by humans for Actinobacteria are a very large phylum of Gram- other purposes (e.g. caffeine is produced by some positive bacteria that are wide spread across plants to protect them against bugs, since it is terrestrial and aquatic ecosystems. They are poisonous in high concentrations, and to prevent primarily chemoheterotrophs (i.e. they get their germination of other plants in the area). These energy from ingesting organic compounds) that metabolites are synthesized in many ways and are have varying tolerances of oxygen levels, from therefore also differently classified. The classes that strictly aerobic to strictly anaerobic. In soil systems are used in this study are polyketides (“assembly of forests and agricultural land they are of great line”-like synthesis with a low number of building importance since they help decompose rotten and blocks), Non-Ribosomal Peptides(NRPs, synthesized dead material, so that it can be taken up by plants. in almost the same way as polyketides but have Some species even have symbiotic relations with non-proteinogenic amino acids as the building 1 BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018 blocks)8, Ribosomally synthesized and Post- translationally modified Peptides (RiPPs, contains a In this research project antiSMASH is used to find gene coding for a single core peptide and multiple gene clusters that result in secondary metabolites. other genes coding for modification enzymes), AntiSMASH predicts gene clusters based on the Terpenes (use isoprene as building blocks), PFAM15 domains that are present and are Saccharides (part of the carbohydrate synthesis), associated with biosynthetic genes. The newest Polyketides/NRPs hybrids9 (PKS-NRP), Other (all version of antiSMASH will be used in the project metabolites that cannot be classified in one of the with and without the newly incorporated other classes, e.g. alkaloids10). Within these classes, ClusterFinder algorithm. The ClusterFinder the metabolites are even further divided into algorithm is a BGC prediction pipeline which first groups to be even more specific, for instance the annotates the genome sequence and converts it to class polyketide contains the groups; antifungal string of Pfam domains, then it calculates the polyenes, macrolides (14-membered), macrolides probability of each domain to be part of a BGC (16-membered), statins and many more groups. based on two conditions; the first being the These groups divide the metabolites even more in frequency of the domain in the BGC training and the way they are built (table 1). non-BGC training sets and the second being the Since the rise of the rapid and cheap DNA identities of the neighboring domains. The last step sequencing techniques many new BGCs have been is to cluster genes that contain Pfam domains where found. Not all these BGCs have been linked to the the probability of a BGC hidden state is above a metabolites they create. This means that there certain threshold. After antiSMASH, BIG-SCAPE8 is might be many more clinically useful metabolites used to define a distance matrix between the found that await discovery11. Recently, genome mining of domains. BIG-SCAPE was developed to create a these BGCs became vital for finding new molecules distance matrix using three different components; which lead to numerous new compounds. To mine jaccard index (JI), domain similarity score (DSS) and all these BGCs, multiple tools have been developed the Adjacency Index (AI). Since a BGC is a collection (e.g. ClustScan for NRPS and PKS12, BAGEL313 for of genes that together encode a set of proteins that RiPPs and antiSMASH14 for all classes) to help will eventually synthesize the compound, the scientists in this mission of finding molecules that distances are based on the domains in the BGCs. So, could be used in medicine. The problem here is that the JI is the ratio between distinct shared and there are so many novel BGCs, that it becomes distinct unshared domains of a pair of BGCs, the DSS difficult to know where to start. This can be measures the sequence similarity between domains simplified by finding BGCs that show similarities to and is normalized for the difference in copy number known BGCs of which the products are already of the domains. For the calculation of the DSS, the clinically useful. In the end this might help in finding key domains (here called anchor domains) and less novel varieties of already existing drugs or novel important domains (non-anchor domains) are taken interesting natural products for which other into account separately. The anchor domains can be microorganisms are not yet resistant. virtually multiplied by a factor called the anchor Class Nr. of groups Nr. of members boost. This makes the anchor domains weigh in more in the calculation of the DSS. The AI is the NRPS 24 131 ratio between the shared adjacent domain-pairs Other 4 24 and the unshared domain-pairs without taking the order of the domain-pairs into account. The final PKS-NRP 5 17 distance, that is used in the distance matrix, is PKSI 36 142 comprised of the three similarity measures using formula 1. PKSother 1 10 RiPPs 12 36 퐷 = 1 − (푎퐽퐼푝푞 + 푏퐷푆푆푝푞 + 푐퐴퐼푝푞) Saccharides 3 10 Formula 1: The distance calculation used by BiG-SCAPE. Terpenes 4 19 Here D is the final distance between BGCs p and q. a, b and c are the weights that are given to the JI, DSS and AI Table 1: Shows how many known groups and respectively. Where a + b + c = 1. compounds there were in the test set. This shows that not all compounds that are in the MIBiG have Based on this distance matrix, BiG-SCAPE creates a assigned groups. network of all BGCs which contains all Gene Cluster Families (GCFs). These GCFs contain all BGCs that BGC MINING AND CLUSTERING have a similar structure in Pfam domains which 2 BSc. A. Roeters - Probing the biosynthetic diversity of actinobacteria 29-01-2018 would result in a similar metabolite. These networks created by BiG-SCAPE using all entries in the are created by BiG-SCAPE which uses a clustering database as input. Further used options were the “- algorithm called affinity propagation16 (AP), which -hybrids”, to also obtain networks from hybrid unlike k-means clustering doesn’t need the number clusters like PKS-NRP. “--mix” to get network files of clusters predefined. This algorithm uses a that have all classes together and “--mode lcs” to concept called message passing to calculate the only use the longest slice mode, which is a glocal final clusters. These messages will be sent around alignment between two BGCs. In this alignment between neighboring data points, until the clusters mode, the longest common substring will be are resolved, and all data points are, or are linked to extended to both sides with matches and so called exemplars (a representative for the data mismatches until the BGCs don’t overlap anymore. point itself and the data points that are linked to A smaller subset of the domains in the BGCs, will them).