<<

In silico studies of modular biosynthetic clusters

Dem Fachbereich Biologie der Universität Kaiserslautern zur Erlangung des akademischen Grades “Doktor der Naturwissenschaften” eingerichte

DISSERTATION

Vorlegt von Antonio Starčević

Vorsitzende: Prof. Dr. Matthias Hahn Erster Berichterstatter: Prof. Dr. John Cullum Zweiter Berichterstatter: Prof. Dr. Regine Hakenbeck

Lehrbereich Genetik der Universität Kaiserslautern, Januar 2009

D386 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

PUBLISHED SCIENTIFIC PAPERS

The thesis is a cumulative thesis based on the following four published papers. The papers bound in the thesis are identical to the published versions.

1. Starcevic, A., J. Cullum, M. Jaspars, D. Hranueli & P.F. Long. Predicting the nature and timing of epimerisation on a modular polyketide synthase. ChemBioChem, 8, 28–31, 2007.

2. Starcevic, A., J. Zucko, J. Simunkovic, P.F. Long, J. Cullum & D. Hranueli. ClustScan: An integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures. Nucleic Acids Res., 36, 6882-6892, 2008.

3. Milne, B.F., P.F. Long, A. Starcevic, D. Hranueli & M. Jaspars. Spontaneity in the patellamide biosynthetic pathway. Org. Biomol. Chem., 4, 631-638, 2006.

4. Starcevic, A., S. Akthar, W.C. Dunlap, J.M. Shick, D. Hranueli, J. Cullum & P.F. Long. of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins. Proc. Natl. Acad. Sci. USA, 105, 2533-2537, 2008.

2 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

TABLE OF CONTENTS

1 INTRODUCTION 4

1.1 Literature overview 5

1.2 Goals and work objectives 13

2 SCIENTIFIC PAPERS 15

2.1 Predicting the nature and timing of epimerisation on a modular Polyketide synthase 15

2.2 ClustScan: An integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures 21

2.3 Spontaneity in the patellamide biosynthetic pathway 33

2.4 Enzymes of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins 42

3 DISCUSSION 48

3.1 The role of ketoreductase domain in determining methyl stereochemistry 52 3.2 Predicting linear products of modular polyketide gene clusters 53 3.3 Microbial origins of shikimic acid pathway enzymes encoded in the genome of Nematostella vectensis 57 3.4 Carsonella rudii and Buchnera aphidicola – another case of horizontal gene transfer 58

4 OUTLOOK FOR FUTURE WORK 60

5 ABSTRACT 61

6 REFERENCES 62

ACKNOWLEDGMENTS 67

APPENDICES 68

Curriculum Vitae 69

3 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

1 INTRODUCTION

Many important secondary metabolites in are synthesised by modular biosynthetic clusters: polyketide synthase (PKS), non-ribosomal peptide synthetase (NRPS), NRPS- independent siderophore (NIS) synthetase and hybrid clusters. These include polyketide antibiotics (e.g. erythromycin), immuno-suppressants (e.g. rapamycin) and antiparasitics (e.g. avermectin) as well as peptide antibiotics (e.g. penicillin), immuno-suppressants (e.g. cyclosporin) and herbicides (e.g. bialaphos) (Weissman and Leadlay, 2005; Finking and Marahiel, 2004; Hranueli et al., 2005; Challis, 2005).

This year marks the eightieth anniversary of the discovery of penicillin (Fleming, 1929) (Fig. 1). Since then pharmaceutical companies and academic institutions have screened cultivable microorganisms from different habitats for their natural products.

Fig. 1. Commercial secondary metabolites discovered.

Tens of thousands of biologically active polyketide and non-ribosomal peptides with important pharmacological properties have been discovered, less than 100 of which have found their way to the market. Therefore, only 1 to 3 biologically active chemical entities per 10.000

4 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

described have found clinical use. However, during the last 40 years, it has become ever more difficult to isolate cultivable microorganisms with new biologically interesting activities and the rate of discovery of novel drugs has dropped considerably. The major question now is: are there any alternative strategies to exploit the chemical potential of small molecule natural products?

Many pharmaceutical companies abandoned natural products and turned to high-throughput screening of chemical libraries for potential new targets (Feher and Schmidt, 2003). However, today we know that this is not a productive approach. Maybe a modified „return to nature“ paradigm can still give us results we need. This will require extensive „in silico“ or bioinformatic components. There are numerous possibilities, but they have a common strategy that the search for novel compounds is driven from the DNA sequence. A simple first step is to filter out known biosynthetic pathways. For novel pathways it should be possible to make partial prediction of compound structures that will help select the most interesting synthesis clusters. Similar methods can be used for in silico recombination between clusters to look for novelty. The analysis can also underpin the laboratory work by suggesting methods to express desired pathways and designing purification schemes for their products.

1.1 Literature overview

Large scale DNA sequencing has revealed many modular gene-clusters, whose products are unknown (Bentley et al., 2002; Ikeda et al., 2003; Oliynyk et al., 2007; Ohnishi et al., 2008; Hranueli et al., 2009). The genome sequencing of bacteria is expanding rapidly with more than 1.000 sequenced bacterial genomes and more than fourteen hundred additional genome sequencing projects in progress (Fig. 2). Advances in sequencing technology ("next generation" sequencing) will lead to an explosive increase in known genome sequences in the near future (Shaffer, 2007). Modular polyketide and non-ribosomal peptide gene clusters, which can collectively be called Thiotemplate Modular Systems (TMS), have been analyzed within about 250 bacterial genomes. It was shown that the bacteria whose genomes are smaller than 4 Mb have only a few or no TMS gene clusters. The number of clusters grows proportionally with the size of the bacterial genome (Donadio et al., 2007). Actinobacteria are a particularly prolific

5 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

source and each genome sequenced to date encodes more than one gene cluster for the synthesis of such secondary metabolites.

Fig. 2. Current status of bacterial genome sequencing projects at NCBI (January, 2009).

The National Center for Biotechnology Information (NCBI) Genome database contains 55 sequenced Actinobacteria genomes with more than 110 additional genome sequencing projects in progress (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html, accessed in December 2008). One can expect that with the development of “next generation” sequencing technology more than 1.000 Actinobacteria genomes will be sequenced in the next year or two. Under the conservative assumption that most of sequenced Actinobacteria genome contains about 10 TMS gene clusters for unknown biosynthetic products, one can expect more than 10.000 new clusters most of them producing as yet undescribed chemical entities. The majority of these gene clusters are probably actively expressed. They have typical TMS gene, module and domain organisation with no detectable rearrangements or mutations that would result in pseudogenes. There is, therefore, significant untapped chemical diversity in these sequenced genomes and gene clusters.

6 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Table 1. Current status of Actinobacteria genome sequencing projects. The number of sequenced (A) and sequencing in progress (B) genomes is showed (December, 2008; source of information NCBI). A

Organism Group GC% Accession Center Acidothermus cellulolyticus 11B Actinobacteria 66.9 CP000481.1 DOE Joint Genome Institute Arthrobacter aurescens TC1 Actinobacteria 62.4 CP000474.1 TIGR [more] Arthrobacter sp. FB24 Actinobacteria 65.4 CP000454.1 DOE Joint Genome Institute Bifidobacterium adolescentis ATCC 15703 Actinobacteria 59.2 AP009256.1 Gifu University, Life Science Research Center, Japan Bifidobacterium longum DJO10A Actinobacteria 60.2 CP000605.1 University of Minnesota Bifidobacterium longum NCC2705 Actinobacteria 60.1 AE014295.3 Nestle Research Center, Switzerland Bifidobacterium longum subsp. infantis ATCC 15697 Actinobacteria 59.9 CP001095.1 DOE Joint Genome Institute Clavibacter michiganensis subsp. michiganensis NCPPB 382 Actinobacteria 72.5 AM711867.1 Bielefeld Univ Clavibacter michiganensis subsp. sepedonicus Actinobacteria 72.4 AM849034.1 Welcome Trust Sanger Institute Corynebacterium diphtheriae NCTC 13129 Actinobacteria 53.5 BX248353.1 Welcome Trust Sanger Institute Corynebacterium efficiens YS‐314 Actinobacteria 63.1 BA000035.2 NITE Corynebacterium glutamicum ATCC 13032 Actinobacteria 53.8 BA000036.3 Kitasato Univ. Corynebacterium glutamicum ATCC 13032 Actinobacteria 53.8 BX927147.1 Universitaet Bielefeld Corynebacterium glutamicum R Actinobacteria 54.1 AP009044.1 Research Institute of Innovative Technology for the Earth (RITE) Corynebacterium jeikeium K411 Actinobacteria 61.4 CR931997.1 Bielefeld Univ Corynebacterium urealyticum DSM 7109 Actinobacteria 64.2 AM942444.1 Bielefeld Univ Frankia alni ACN14a Actinobacteria 72.8 CT573213.2 Genoscope Frankia sp. CcI3 Actinobacteria 70.1 CP000249.1 DOE Joint Genome Institute Frankia sp. EAN1pec Actinobacteria 71.2 CP000820.1 DOE Joint Genome Institute Kineococcus radiotolerans SRS30216 Actinobacteria 74.2 CP000750.2 DOE Joint Genome Institute Kocuria rhizophila DC2201 Actinobacteria 71.2 AP009152.1 National Institute of Technology and Evaluation Leifsonia xyli subsp. xyli str. CTCB07 Actinobacteria 67.7 AE016822.1 Sao Paulo state (Brazil) Consortium Mycobacterium abscessus Actinobacteria 64.1 CU458896.1 Genoscope Mycobacterium avium 104 Actinobacteria 69.0 CP000479.1 TIGR Mycobacterium avium subsp. paratuberculosis K‐10 Actinobacteria 69.3 AE016958.1 University of Minnesota Mycobacterium bovis AF2122/97 Actinobacteria 65.6 BX248333.1 Welcome Trust Sanger Institute Mycobacterium bovis BCG str. Pasteur 1173P2 Actinobacteria 65.6 AM408590.1 Institut Pasteur Mycobacterium gilvum PYR‐GCK Actinobacteria 67.7 CP000656.1 DOE Joint Genome Institute Mycobacterium leprae TN Actinobacteria 57.8 AL450380.1 Welcome Trust Sanger Institute Mycobacterium marinum M Actinobacteria 65.7 CP000854.1 Welcome Trust Sanger Institute Mycobacterium smegmatis str. MC2 155 Actinobacteria 67.4 CP000480.1 TIGR Mycobacterium sp. JLS Actinobacteria 68.4 CP000580.1 DOE Joint Genome Institute Mycobacterium sp. KMS Actinobacteria 68.2 CP000518.1 DOE Joint Genome Institute Mycobacterium sp. MCS Actinobacteria 68.4 CP000384.1 DOE Joint Genome Institute Mycobacterium tuberculosis CDC1551 Actinobacteria 65.6 AE000516.2 TIGR Mycobacterium tuberculosis F11 Actinobacteria 65.6 CP000717.1 Broad Institute Mycobacterium tuberculosis H37Ra Actinobacteria 65.6 CP000611.1 Chinese National HGC, Shanghai Mycobacterium tuberculosis H37Rv Actinobacteria 65.6 AL123456.2 Welcome Trust Sanger Institute Mycobacterium ulcerans Agy99 Actinobacteria 65.5 CP000325.1 Institut Pasteur Mycobacterium vanbaalenii PYR‐1 Actinobacteria 67.8 CP000511.1 DOE Joint Genome Institute Nocardia farcinica IFM 10152 Actinobacteria 70.7 AP006618.1 Department of Bioactive Molecules, National Institute of Infectious Diseases Nocardioides sp. JS614 Actinobacteria 71.4 CP000509.1 DOE Joint Genome Institute Propionibacterium acnes KPA171202 Actinobacteria 60.0 AE017283.1 University of Goettingen Renibacterium salmoninarum ATCC 33209 Actinobacteria 56.3 CP000910.1 Northwest Fisheries Science Center, National Marine Fisheries Service, NOAA Rhodococcus sp. RHA1 Actinobacteria 67.0 CP000431.1 University of British Columbia, Canada Rubrobacter xylanophilus DSM 9941 Actinobacteria 70.5 CP000386.1 DOE Joint Genome Institute Saccharopolyspora erythraea NRRL 2338 Actinobacteria 71.1 AM420293.1 Department of Biochemistry, University of Cambridge Salinispora arenicola CNS‐205 Actinobacteria 69.5 CP000850.1 DOE Joint Genome Institute Salinispora tropica CNB‐440 Actinobacteria 69.5 CP000667.1 DOE Joint Genome Institute Streptomyces avermitilis MA‐4680 Actinobacteria 70.7 BA000030.3 NITE Streptomyces coelicolor A3(2) Actinobacteria 72.0 AL645882.2 Welcome Trust Sanger Institute Streptomyces griseus subsp. griseus NBRC 13350 Actinobacteria 72.2 AP009493.1 University of Tokyo, Japan Thermobifida fusca YX Actinobacteria 67.5 CP000088.1 DOE Joint Genome Institute Tropheryma whipplei str. Twist Actinobacteria 46.3 AE014184.1 Genoscope Tropheryma whipplei TW08/27 Actinobacteria 46.3 BX072543.1 Welcome Trust Sanger Institute

7 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

B

8 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Moreover, recent developments in DNA sequencing technology have opened up new opportunities allowing the DNA sequencing of uncultivable microorganisms. This field of study, called metagenomics, allows access to the genetic potential of both cultivable and uncultivable microorganisms in a particular environment (Dunlap et al., 2006). In 2004 the Craig Venter's Institute launched the Sorcerer II global ocean sampling expedition to evaluate the microbial diversity in the world's oceans using molecular tools and techniques originally developed to sequence the human genome. Their data are the largest metagenomic dataset ever put into the public domain with more than 7.7 million sequencing reads comprising 6.3 billion base pairs of DNA (Rusch et al., 2007). There are several hundred sequenced modular gene-clusters in public databases that have not been adequately analysed and their number will grow exponentially with the new bacterial genome and metagenome sequencing projects.

Traditional screening approaches will miss many interesting compounds, because they are not produced under the conditions used. Recently, there was a mathematical model published, showing that there is a potential of nearly 100.000 secondary metabolites in Streptomyces strains alone (Watve et al., 2001). That means we know less than 10% of the total number available. Additionally, purification of an unknown compound is costly and time- intensive. Gene-clusters encoding secondary metabolic biosynthetic pathways can be recognised by bioinformatic „in silico“ methods which allow partial prediction of the nature of the compounds produced (Donadio et al., 2007). Computational tools are now needed to rapidly mine such massive datasets to hunt for gene-clusters encoding entirely new compounds which could be the medicines of the future.

The biosynthesis of modular polyketides occurs by a series of sequential steps and each step is encoded by a group of catalytic domains grouped in different modules. The modules are usually arranged across several polypeptides, with often more than one module in a polypeptide. In the case of PKS modules. The acyl transferase domain (AT) couples the new extender unit to the acyl carrier protein (ACP) domain. The ketosynthetase domain (KS) then transfers the growing chain to the new extender unit and performs the condensation reaction with elimination of CO2. The basic (“essential”) module thus needs only KS-AT-ACP and will build a keto group into the chain. Many modules contain further reduction domains. If a ketoreductase (KR) domain

9 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

is present, the keto group will be reduced to a hydroxyl group. If there is also a dehydratase domain (DH) the hydroxyl group will be reduced to a double bond and if there is also an enoyl reductase domain (ER), the keto-group will be reduced completely. (Sattely et al., 2008) This modularity implies that if such systems could be programmed at will, there would be a large number of possible compounds (e.g. if there were only 10 modules there would be at least 109 compounds possible).

It also implies that it should be possible to predict the biosynthetic activity of a module from the DNA sequence that encodes it. The choice of extender unit seems to be governed by the AT domain and differences between domains that choose malonyl-CoA, methoxymalonyl-CoA, methylmalonyl-CoA and ethylmalonyl-CoA as substrates have been recognised. In addition to AT domains in extender modules, there are often AT domains in starter modules which are somewhat confusingly called "loading domains". A set of AT starter domains that incorporate malonyl, propionyl and methylbutyryl starters have also been recognised. There are also at least 10 known unusual loading domains where AT domains are not present. Nevertheless, amino acid sequences in loading domains responsible for the choice of unusual starters are also recognised (Sattely et al., 2008). It is also possible to recognise the reduction domains (KR, DH and ER) by sequence similarity and, thus, predict the degree of the reduction of α- carbon and the stereochemistry of both α- and β-carbons. This is complicated by the fact that sometimes non- active reduction domains are present. The KR domain is the best characterised domain in terms of structural determination of differential activities. Active KR domains determine the chirality of the hydroxyl product and bioinformatic analysis identified amino acid residues involved in this choice (Caffrey, 2003). Bioinformatics also suggested that KR rather than KS determined the stereochemistry of β-carbon groups, when C3 or C4 units are used as substrates (work in this thesis, Starcevic et al., 2007). A comparison of 3-D structures of two KR domains of different specificity gave more detailed information on amino acid residues involved in determining both hydroxyl and β-carbon stereochemistry (Keatinge-Clay, 2007). There are six possible products (A1, A2, B1, B2, C1, C2), which correspond to three possible ketoreduction outcomes (two hydroxyl stereoisomers - A or B, or no reduction C) coupled with two β-carbon chiralities (called 1 and 2). For DH and ER domains, the prediction should firstly distinguish active and inactive domains. It is also believed that the DH domains determine the cis/trans orientation. However,

10 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

insufficient structural information is available to make predictions based on knowledge of function alone. After complete reduction of the keto-group the ER domain determines the stereochemistry of the side chain on the β-carbons (Kwan et al., 2008).

Modular NRPS and NIS gene-clusters have a similar organisation to modular PKS gene- clusters, but the modules assemble amino acids (Finking & Marahiel, 2004; Challis, 2005). The modules are composed of catalytic domains with standard (condensation, adenylation and peptidyl carrier protein; C-A-PCP) domains present in all modules as well as optional domains (e.g. epimerization (Epi), cylization (Cyc), N-methylation (N-Meth), domains). In addition to the 20 amino acids present in most proteins, NRPSs may incorporate other amino acids (more than 400 are known to date). There is, thus, much potential for combinatorial biosynthesis.

Several computational tools for exploration of modular gene clusters have been developed. A tool that helps the analysis of new gene-clusters is the NRPS-PKS database (SEARCHPKS; Yadav et. al, 2003; http://www.nii.res.in/ nrps-pks.html), which holds data on PKS and NRPS gene-clusters including module and domain structure and chemical structures of the gene-cluster products. It allows users to input protein sequences to be used in BLAST (Altschul et al., 1990) searches to identify domains and find the closest sequences in the database. This allows prediction of whether an AT-domain uses malonyl-CoA or methylmalonyl-CoA as a substrate (i.e. whether a C2 or C3 unit is incorporated into the polyketide). Within the ASMPKS database (Tae et al., 2007; http://gate.smallsoft.co.kr:8008/ %7Ehstae/asmpks/index.html) there is another tool: MAPSI that uses a similar analysis methodology, but integrates it with a graphical display that shows the presence of domains in genes so that modules can be easily recognised. It also allows the display of a predicted linear polyketide chain product for which the user has to select starter and extender units from the list. However, the stereochemistry of the product is not considered. The company ECOPIA has also developed a software tool DecipherITTM (Zazopoulos et al., 2003), which helps annotation of new gene-clusters based on comparison with a database of known gene-clusters. A slightly different tool is the Biogenerator program (Zotchev et al., 2006), which allows the construction of new polyketides based on known modules in silico. The programs listed above all depend on comparison with sequences in a database to analyse new gene-clusters. This works best for gene-clusters closely related to well-

11 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

characterised gene-clusters. However, almost all of them are using predictions based on similarity searches and can be considered rudimentary.

In the last few years there has also been considerable published work to identify protein features responsible for specificity of domains. Amino acid residues in AT domains that differ between malonyl-CoA-incorporating and methylmalonyl-CoA-incorporating domains have been identified from multiple alignments of AT sequences (Yadav et al., 2003). Specific residues in the ketoreductase (KR) domain that seem to be responsible for the stereochemistry of hydroxyl groups and of the β-carbon atom in the extender unit have also been identified (Caffrey, 2003; Starcevic et al., 2007; Keatinge-Clay, 2007; Castonguay et al., 2007). A complication in the analysis of modules is that inactive reduction domains (i.e. KR, DH, ER) may be present. In some cases, inactivation is associated with substantial deletion (i.e. a much shorter domain), but in other cases only a small number of critical residues may be changed.

So far major bioinformatic efforts regarding specificity of domains were based on similarity approach using BLAST (Altschul et al., 1990). Unfortunately BLAST suffers from the problems that the weighting of each residue is independent of position and that insertion or deletion of residues is not handled well. In contrast, the hidden Markov model (HMM) approach used in the HMMER suite of programs (Eddy, 1998), deals with each amino acid position separately and, thus, distinguishes between critical highly conserved residues and less important positions. In addition, the HMMs consider insertion and deletion states and assign them probabilities. The hmmalign program of the HMMER suite aligns sequences with respect to the HMM profile allowing accurate identification of particular residue positions. A similar HMM approach was used to predict substrate specificity of NRPS adenylation domains, which has been incorporated into a program package called the NRPSpredictor (Rausch et al., 2005; http://www.ab. informatik.uni-tuebingen.de/software/NRPSpredictor/welcome.html).

12 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

1.2 Goals and work objectives

Recently, the marine environment emerged as abundant source of novel unexplored natural products. Organisms like marine actinomycetes, dinoflagellates, seasquirts and corals are being screened and tested for presence of secondary metabolites. Such gene clusters would probably only show low similarity to known clusters. This imposes serious problems for the position- independent pairwise comparisons used in BLAST (Altschul et al., 1990) or even the more sensitive Smith-Waterman algorithm (Smith and Waterman, 1981).

At the beginning of our bioinformatic work, this was the main obstacle we had to overcome since we had to deal with marine organisms. We needed to find other ways of tackling distant homology recognition beside pairwise comparison. So, we had set up major goals and work objectives as follows:

If we start with the premise that we know what we are looking for in a genome, metagenome or any kind of contiguous sequence, then we might begin by collecting all possible sequences of genes we are interested in and build an HMM profile out of them. We expected profiles to show greater sensitivity in picking up remote homologies as long as we built them from large enough data samples. Also, once we had recognized genes and domains in sequence data, how shall we accomplish our ultimate goal of identifying a complete gene cluster, in a sequence which could be of genomic size? For that purpose, we had to develop a computer tool which integrated known algorithms, incorporated our expertise and ultimately allowed us to manage this knowledge efficiently. So we needed some kind of DNA sequence Knowledge Management. This Knowledge Management system should be:

• robust enough to handle long DNA sequences • sensitive - in order to pick up distant homologies • adaptable - to handle different kind of requests (e.g. searching for PKS, NRPS, NIS or any other gene cluster) • pliable - so that new knowledge is easily integrated on the run

13 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

• and last, but not the least – intuitive and user friendly (e.g. graphical user interface as well as display of search results)

Since most modular biosynthetic clusters are highly ordered in a way that they choose only the correct building blocks and always perform the same types of modifications on them, the resulting products are almost always uniform. This made us believe that we can also incorporate a model for automated domain specificity and stereochemistry prediction based on sequence analysis. To be able to test this hypothesis, a core database consisting of several well annotated modular biosynthetic clusters needed to be built. This database should hold all the necessary data (e.g. domain, module and gene sequences, building blocks and their stereochemistry etc.) so that computer predictions could be reasonably validated against actual data.

For this purpose we have chosen PKS type I clusters. The major reason being their remarkable modularity, but also the fact that there is much more information on them in comparison to other types of modular clusters. We did however need to solve a puzzle of stereochemistry of methyl groups and of the β-carbon atom in the extender unit in order for our predictions to be more complete.

The computer tool once developed should also work on organisms other then actinomycetes, so far the most abundant source of natural products. It should help us recognise modular gene clusters in marine genomes, since there is an obvious trend in Microbiology and natural products in general towards marine environments as a source of new drugs. The ability to rapidly and uniformly annotate clusters, modules and domains is an important prerequisite for constructing databases for the investigation of fundamental questions such as the evolution of PKS clusters.

14 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

2 SCIENTIFIC PAPERS

2.1 Predicting the nature and timing of epimerisation on a modular polyketide synthase

Antonio Starcevic, John Cullum, Marcel Jaspars, Daslav Hranueli and Paul F. Long. ChemBioChem, 8, 28–31, 2007.

Abstract:

Modular polyketide synthases (PKSs) are multifunctional complexes that catalyse the synthesis of a structurally diverse group of secondary metabolites – the polyketides. Many polyketides are produced by Streptomyces and related filamentous bacteria; some are of commercial significance including the antibiotics erythromycin A and rifamycin, the immunosuppressive macrolide rapamycin and the antitumour compound mithramycin. Since many of these compounds are high value biopharmaceuticals, there has been intense interest in studying the chemistry and molecular biology of biosynthesis on these megasynthases. The modular PKS leading to the formation of the erythromycin polyketide backbone (6- deoxyerythronolide B synthase or DEBS) in Saccharopolyspora erythraea is one of the best studied, especially regarding stereochemistry. Polyketide biosynthesis proceeds via a Claisen condensation between simple carboxylic acids with inversion of configuration to give a (2R) methyl centre. However, the methyl stereochemistry present in the erythromycin polyketide backbone, namely (2R), (4R), (6S), (8R), (10R) and (12S), implies that two epimerisation steps occur. We used a bioinformatics approach to identify a putative epimerase function in the relevant ketoreduction (KR) domains. For profile analysis the current version of HMMER and current release of the Pfam database were used. All three DEBS protein sequences were extracted from GenBank and searched against the entire Pfam database locally. The Secondary Structure Matching server at EBI was used as a tool for overlapping known crystal structures. Multiple alignments of A- and B-type KR sequences were constructed using ClustalW. Family profiling was first undertaken to predict the most likely location for epimerisation within the domains of the DEBS proteins. Sequence comparisons were then used to identify particular amino acids that could be involved in this catalysis. Family profiling within DEBS confirmed the sequence positions of all the known catalytic domains. Interestingly, however, when the NAD dependent epimerase/dehydratase family within Pfam was used as a search template, putative

15 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

epimerase motifs were identified within the boundaries of all the KR domains. Therefore, we postulate that C2 epimerization and C3 ketoreduction are coordinated processes and that the KR domains within modular polyketide synthases should be referred to as isomeroreductases to reflect their bifunctional nature.

Own contribution to the paper:

It was generally assumed that control of methyl stereochemistry was encoded by the KS domain. However, no differences could be detected. Therefore, it was decided to look for possible epimerase signals elsewhere. This lead to detection of possible similarities in KR domains and the novel proposal that the methyl stereochemistry was encoded by the KR domain. In order to publish this proposal it was necessary to also propose a chemical mechanism. Discussions with Prof. Jaspars and 3-D modelling of KR by threading sequences on a known related crystal structure resulted in a plausible chemical mechanism for the observed epimerase signals.

16 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

DOI: 10.1002/cbic.200600399 (TE) domain are housed within three polypeptides that collec- tively catalyse the condensation and appropriate reduction of Predicting the Nature and Timing of one propionyl-CoA starter unit and six (2S)-methylmalonyl-CoA extender units. Modules 1, 2, 5 and 6 contain a KR domain; Epimerisation on a Modular module 4 contains a complete set of reductive domains— Polyketide Synthase namely KR, DH and ER; while module 3 has no reductive func- tionality. Polyketide biosynthesis proceeds by a Claisen con- Antonio Starcevic,[b] Marcel Jaspars,[c] John Cullum,[d] densation with inversion of configuration to give a 2R methyl Daslav Hranueli,[b] and Paul F. Long*[a] centre. However, in module 1, this decarboxylative condensa- tion involves cleavage of the CH bond adjacent to the methyl The modular polyketide synthase leading to the formation of group at C2 of the extender unit to produce a 2S methyl the erythromycin aglycone 6-deoxyerythronolide B (6dEB) in centre. This implies that there must be an epimerisation step Saccharopolyspora erythraea is one of the best studied, espe- taking place, which would account for the combinations of cially regarding stereochemistry. Biosynthesis proceeds by methyl stereochemistry present in 6-dEB, namely 2R,4R,6S,8R, Claisen condensation with inversion of configuration to give a 10R,12S.[4] It is possible that proton exchange between the b- 2R methyl centre. However, the methyl stereochemistry of 6- ketoester and water could promote epimerisation, with down- dEB implies that two epimerisation steps occur. We have taken stream domains then displaying absolute specificity for only a a bioinformatics approach to explore the mechanisms for this given isomer;[5,6] but how this reaction proceeds remains an bifunctionality.[1,2] enigma, since discrete epimerisation domains have never been Polyketides are assembled through serial condensations of discovered in these modular systems. It would be a very attrac- activated coenzyme A thioester monomers derived from tive proposition to discover cryptic epimerisation activities simple organic acids such as acetate, propionate and butyrate. housed within already described domains. Two types of KR do- The activities required for each cycle of condensation and mains, termed A and B types, have been described.[7] Recently, processing of the b-keto group are contained within a module, the crystal structure of the KR domain from module 1 of 6-dEB and a discrete domain represents each activity. The active sites synthase (DEBS) was reported at 1.79 resolution.[8] This pro- required for condensation include an acyltransferase (AT), an tein folds to adopt two domains, one serving a purely structur- acyl carrier protein (ACP) and a b-ketoacylsynthase (KS). Each al role by stabilising the other domain, which catalyses reduc- condensation results in a b-keto group that undergoes all, tion of the C3 carbonyl of the diketide. Both contain identical some or none of a series of processing steps. Active sites that active-site regions buried within the protein core, but differ ac- perform these reactions are contained within the ketoreduc- cording to the presence of a D residue on one side of the tase (KR), dehydratase (DH) and enoylreductase (ER) domains. active-site groove that is thought to direct the substrate thio- The absence of any b-keto processing results in the incorpora- ester carbonyl towards the amino group of a conserved Y, cata- tion of a ketone group into the growing polyketide chain, a KR lysing an S reduction by the B forms. In the A forms, this D res- alone gives rise to a hydroxyl moiety, a KR and a DH produce idue is absent or replaced by W so that the direction of hy- an alkene, while the combination of KR, DH and ER domains dride addition is to the Si rather than Re face of the keto lead to complete reduction to an alkane. After assembly, the group, thus giving rise to an R chirality. The authors also sug- polyketide chain typically undergoes cyclisation and post-PKS gest that a Y residue, conserved in all KR domains, is flanked modifications to give the final active compound.[3] by amino acids in epimerising modules allowing flexibility to The enzymic steps leading to the formation of 6-dEB are catalyse epimerisation of the methyl stereocentre at C2. This Y ACHTUNGTRENNUNGencoded within six distinct modules. Thesecan act modules, as a base, together extracting a proton from the diketide sub- with a loading domain and a chain-terminating thioesterase strate to form an enolate. Proton addition to the enolate oxygen releases the diketide from the KR, but this epimerised diketide can undergo a tautomeric shift back to the keto form, [a] Dr. P. F. Long The School of Pharmacy, University of London presumably acting as an additional substrate for the KR. We 29/39 Brunswick Square, London WC1N 1AX (UK) believe that this mechanism would stall further polyketide Fax: (+ 44)207-753-5868 chain synthesis, which seems unlikely given that the overall E-mail: [email protected] Km/Kcat for 6-dEB synthesis is relatively rapid for a multifunction- [b] A. Starcevic, Prof. Dr. D. Hranueli al enzyme.[9] Here we take a bioinformatics approach to ex- Section for Bioinformatics, Department of Biochemical Engineering Faculty of Food Technology and Biotechnology, University of Zagreb plore an alternative mechanism that KR domains contain a bi- Pierottijeva 6, 10000 Zagreb (Croatia) functional active site that can fix both the methyl and alcohol [c] Prof. M. Jaspars stereochemistry of b-ketoacyl-ACP intermediates (Figure 1). Department of Chemistry, University of Aberdeen Family profiling was first undertaken to predict the most Old Aberdeen AB24 3UE (UK) likely location for epimerisation within the domains of the [d] Prof. Dr. J. Cullum DEBS proteins. Sequence comparisons were then used to iden- Department of Genetics, University of Kaiserslautern Postfach 3049, 67653 Kaiserslautern (Germany) tify particular amino acids that could be involved in this cataly- Supporting information for this article is available on the WWW under sis. Family profiling within DEBS confirmed the sequence posi- http://www.chembiochem.org or from the author. tions of all the known catalytic domains. Interestingly, however,

28 2007 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim ChemBioChem 2007,8,28–31

17 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Figure 1. Geometry and functionality of active-site residues from the KR domain of DEBS module 1, possibly controlling the methyl and alcohol ster- eochemistries of b-ketoacyl-ACP intermediates.

when the NAD-dependent epimerase/dehydratase family within Pfam was used as a search template, putative epimerase motifs were identified within the boundaries of all the KR do- mains. The KR domains are members of the “short-chain” dehy- [10] Scheme 1. The possible simultaneous stereochemical outcomes of NADH drogenase family and containing a characteristic Y–L couple. ACHTUNGTRENNUNGreductionZ ofand a diketideE geometry. enolate Hydride in the delivery to Overlapping the crystal structure for the KR domain from the Re or Si face is proposed to result in syn addition in which the Z and E module 1 of DEBS with the structure of human UDP-galactose- enolates give syn and anti diastereomers. Alternatively, anti addition to the Z 4-epimerase[11] in complex with NAD+ identified matching resi- enolate would also give rise to the anti products upon hydride delivery. dues at the same positions in both enzymes. We noted that these residues, N, S, Y and K, were also conserved in an amino acid sequence alignment of all KR domains (see the Support- ing Information). Based on these alignment comparisons and Pfam results, we speculate that KR domains are the most likely source of the epimerase activity observed in modular PKSs. In- terestingly, the same conserved residues were not only identi- fied but also anticipated by Reid et al.,[10] to play a critical role in reduction. We speculate that reduction and epimerisation might occur simultaneously through a common reaction mechanism (Scheme 1). This is chemically feasible[13] and has been observed for the reduction of enoylpyruvate by UDP-N- [14] acetylenolpyruvyl glucosamine reductase. As proposed by Scheme 2. The enolate in the KR active site as proposed by Keatinge-Clay Keatinge-Clay and Stroud in their recent paper on the crystal and Stroud.[8] structure of a ketoreductase,[8] the enolate can be stabilised in the active site (Scheme 2). Attack on the Z enolate double bond by the hydride from the NADPH followed by syn addition try, and needs no additional functionality to that observed in of a proton, possibly by S361, controls both the C2 and C3 the active site (Scheme 2). There is experimental support for stereocentres simultaneously and gives rise to the most com- the idea that epimerisation of the methyl group must precede monly observed syn products (Scheme 1). The direction of hy- the reduction reaction by the KR,[6,12] and it has usually been dride delivery governs whether the centres are 2S,3R or 2R,3S. assumed that the KS is responsible for this. A hybrid module To obtain the less frequently observed anti products there are containing the erythromycin KS1 and KR2 domains produced two possibilities, the more likely being the anti addition of a (2R,3S)-2-methyl-3-hydroxypentanoic acid[15] rather than the proton by a suitably located proton donor after delivery of the 2S,3S isomer that is the predominant product (74%) when an hydride by NADPH. Secondly, there is the possibility of bond isolated KR2 domain is used.[12] This suggests that the stereo- rotation followed by enolisation to give the E enolate, which chemistry of the methyl group is not controlled solely by KS1. upon syn reduction, will give rise to the anti products, as Unfortunately, it is not possible to carry out experiments with shown in Scheme 1. The proposed mechanism of reduction is enantiomerically pure substrates because of rapid racemisa- plausible, as the stereochemistry of both centres is controlled tion.[12] The isolated KR1 domain produced 100% of the ex- simultaneously, thus accounting for the observed syn geome- pected 2S,3R isomer. This is consistent with our model, but

ChemBioChem 2007,8,28–31 2007 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.chembiochem.org 29

18 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

could also be explained by the specificity of the KR1 domain for the 2S isomer. The KR2, KR5 and KR6 do- mains were less specific[12] and produced 2S,3S and 2S,3R products in addition to the expected 2R,3S isomer. This lower specificity might be explained by the larger active-site clefts needed to accommodate the larger natural substrates, which would allow more conformational freedom for the small sub- strates used in the experiments. The sequences of these three KR domains all contain the predicted epi- merase motif (see the Supporting Information); this might explain why more of the unexpected 2S,3S isomer was obtained than the predicted 2R,3S prod- uct. KR3 is a “nonactive” domain that does not per- form ketoreduction; however, it contains the predict- ed epimerase motif, and an epimerisation does indeed occur at this step. However, if the “KR do- mains” do indeed have the double activities of epi- merisation and reduction, then many “nonactive” KR domains might still be important for their epimerisa- tion function. A sequential mechanism could also ex- plain the stereochemical observations, the amino acid sequence alignment of KR domains clearly shows an H residue in the active-site cleft that is con- Scheme 3. Sequential mechanisms to control methyl and alcohol stereochemistry. served in all epimerising domains, with the exception of the KR from module 6 of the rapamycin producing Acknowledgements PKS (Supporting Information). This residue is in an ideal geo- metric position to extract a proton from the diketide, with The authors would like to thank Prof. K. Reynolds, Portland State proton addition by Y374 from the opposing face of the eno- University, OR and Dr. P. Caffrey, University College, Dublin, ire late. The next step is stereoselective reduction by NADPH, with + À for their critical comments. H being delivered to the developing O at C3 by S361 (Scheme 3). Interestingly, the KR of module 6 from the rapamy- cin-producing PKS, predicted as epimerising,[16] was found to Keywords: enzymes · epimerization · erythromycin · be nonepimerising by examining the crystal structure of rapa- polyketides · stereochemistry mycin[17] from PDB ID: 1C9H (B. Milne, personal communica- tion). In this KR, the conserved H312 is replaced by Q. The ra- pamycin KR domain in module 3, which has high overall ho- mology to KR domain in module 6, does epimerise and differs [1] S. Donadio, M. Sosio, Comb. Chem. High Throughput Screening 2003, 6, from KR6 by the presence of H (Supporting Information). Site- 489– 500. [2] D. Hranueli, J. Cullum, B. Basrak, P. Goldstein, P. F. Long, Curr. Med. directed mutagenesis for the targeted alteration of stereospe- Chem. 2005, 12, 1697 –1704. cificity would identify the importance of these residues in cat- [3] J. Staunton, K. J. Weissman, Nat. Prod. Rep. 2001, 18, 380 –416. alysis and unambiguously assign a mechanism underpinning [4] K. J. Weissman, M. Timoney, M. Bycroft, P. Grice, U. Hanefeld, J. Staun- the molecular basis of Celmer’s rules. Understanding the con- ton, P. F. Leadlay, Biochemistry 1997, 36, 13849 –13855. [5] I. Holzbaur, R. Harris, M. Bycroft, J. Corts, C. Bisang, J. Staunton, B. trol of stereochemistry will be valuable in the design of hybrid Rudd, P. F. Leadlay, Chem. Biol. 1999, 6, 189– 195. PKSs for future rational compound design. [6] I. Holzbaur, A. Ranganathan, I. Thomas, D. Kearney, J. Reather, B. Rudd, J. Staunton, P. F. Leadlay, Chem. Biol. 2001, 8, 329–340. [7] P. Caffrey, ChemBioChem 2003, 4, 649– 662. [8] A. T. Keatinge-Clay, R. M. Stroud, Structure 2006, 14, 737 –748. Experimental Section [9] R. Pieper, S. Ebert-Khosla, D. Cane, C. Khosla, Biochemistry 1996, 35, 2054 –2060. For profile analysis, the current version of HMMER from http:// [10] R. Reid, M. Piagentini, E. Rodriguez, G. Ashley, N. Viswanathan, J. Carney, hmmer.wustl.edu/ and current release of the Pfam database D. V. Santi, C. R. Hutchinson, R. McDaniel, Biochemistry 2003, 42, 72– 79. (http://www.sanger.ac.uk/Software/Pfam/) were used. All three [11] C. A. Del Carpio, Y. Takahashi, S. Sasaki, J. Mol. Graphics 1993, 11, 23– 42. DEBS protein sequences were extracted from GenBank and [12] A. Siskos, A. Baerga-Ortiz, S. Bali, V. Stein, H. Mamdani, D. Spiteller, B. Popovic, J. Spencer, J. Staunton, K. Weissman, Chem. Biol. 2005, 12, searched against the entire Pfam database locally. The Secondary 1145–1153. Structure Matching server at EBI (http://www.ebi.ac.uk/msd-srv/ [13] J. D. Rozell, Tetrahedron Lett. 1982, 23, 1767– 1770. ssm/) was used as a tool for overlapping known crystal structures. [14] T. E. Benson, C. T. Walsh, J. M. Hogle, Structure 1996, 4, 47– 54. Multiple alignments of A- and B-type KR sequences were construct- [15] I. Bçhm, I. E. Holzbaur, U. Hanefeld, J. Corts, J. Staunton, P. F. Leadlay, ed by using ClustalW (http://www.ebi.ac.uk). Chem. Biol. 1998, 5, 407 –412.

30 www.chembiochem.org 2007 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim ChemBioChem 2007,8,28–31

19 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

[16] T. Schwecke, J. F. Aparicio, I. Molnar, A. Konig, L. E. Khaw, S. F. Haycock, [17] F. V. Ritacco, E. I. Graziani, M. Y. Summers, T. M. Zabriskie, K. Yu, V. S. M. Oliynyk, P. Caffrey, J. Corts, J. B. Lester, G. A. Bohm, J. Staunton, P. F. Bernan, G. T. Carter, M. Greenstein, Appl. Environ. Microbiol. 2005, 71, Leadlay, Proc. Natl. Acad. Sci. USA 1995, 92, 7839 –7843. 1971 –1976.

Received: September 28, 2006 Published online on November 29, 2006

ChemBioChem 2007,8,28–31 2007 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.chembiochem.org 31

20 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

2.2 ClustScan: An integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures

Antonio Starcevic, Jurica Zucko, Jurica Simunkovic, Paul F. Long, John Cullum and Daslav Hranueli. Nucleic Acids Res., 36, 6882-6892, 2008.

Abstract:

The program package ClustScan is designed for rapid, semi-automatic, annotation of DNA sequences encoding modular biosynthetic enzymes including polyketide synthases (PKS), non- ribosomal peptide synthetases (NRPS) and hybrid (PKS/NRPS) enzymes. The program displays the predicted chemical structures of products as well as allowing export of the structures in a standard format for analyses with other programs. Recent advances in understanding of enzyme function are incorporated to make knowledge-based predictions about the stereochemistry of products. The program structure allows easy incorporation of additional knowledge about domain specificities and function. The results of analyses are presented to the user in a graphical interface, which also allows easy editing of the predictions to incorporate user experience. The versatility of this program package has been demonstrated by annotating biochemical pathways in microbial, invertebrate animal and metagenomic datasets. The speed and convenience of the package allows the annotation of all PKS and NRPS clusters in a complete Actinobacteria genome in 2-3 man/hours. The open architecture of ClustScan allows easy integration with other programs, facilitating further analyses of results, which is useful for a broad range of researchers in the chemical and biological sciences.

Own contribution to the paper:

Initial work showed the feasibility of using HMM-profiles to identify domains and specific diagnostic amino acids in domains. The algorithm for predicting the activity and stereospecificity of KR domains was developed. The XML-description that underlies the whole program was defined. The algorithm for producing the chemical structures in a SMILES format (which includes an XML-description of the chemistry was developed. The major supervision of the programmers work was also undertaken.

21 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009 6882–6892 Nucleic Acids Research, 2008, Vol. 36, No. 21 Published online 31 October 2008 doi:10.1093/nar/gkn685

ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures Antonio Starcevic1,2, Jurica Zucko2,3, Jurica Simunkovic3, Paul F. Long4, John Cullum2 and Daslav Hranueli1,*

1Faculty of Food Technology and Biotechnology, University of Zagreb, Pierottijeva 6, 10000 Zagreb, Croatia, 2Department of Genetics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany, 3Novalis Ltd, Bozˇidara Adzˇije 17, 10000 Zagreb, Croatia and 4The School of Pharmacy, University of London, 29/39 Brunswick Square, London WC1N 1AX, UK

Received June 19, 2008; Revised September 23, 2008; Accepted September 24, 2008

ABSTRACT INTRODUCTION The program package ‘ClustScan’(Cluster Scanner) Bioprospecting for lead compounds from nature continues is designed for rapid, semi-automatic, annotation of to be a corner stone in drug development. As well as iso- DNA sequences encoding modular biosynthetic lating microorganisms from unique environments or bio- enzymes including polyketide synthases (PKS), logical diversity ‘hotspots’, approaches are also being > non-ribosomal peptide synthetases (NRPS) and developed to exploit the chemical diversity from 98% of uncultivable microbes living in the natural environ- hybrid (PKS/NRPS) enzymes. The program displays ment. There is now unprecedented opportunity to access the predicted chemical structures of products the natural diversity of small molecules made by such as well as allowing export of the structures in a microbes by the isolation of metagenomic DNA and het- standard format for analyses with other programs. erologous expression of biosynthetic pathways in a fer- Recent advances in understanding of enzyme func- mentable host. Discovery of novel biosynthetic gene tion are incorporated to make knowledge-based clusters is the first goal of this culture-independent predictions about the stereochemistry of products. research that requires the application of molecular bioin- The program structure allows easy incorporation of formatics to identify DNA sequences of interest. We have additional knowledge about domain specificities developed an integrated set of computer programs for this and function. The results of analyses are presented task, which we call the ‘ClustScan’(Cluster Scanner) pro- to the user in a graphical interface, which also gram package. allows easy editing of the predictions to incorporate Many important secondary metabolites in bacteria are user experience. The versatility of this program synthesized on enzymes encoded by modular biosynthetic package has been demonstrated by annotating bio- gene clusters: polyketide synthase (PKS) clusters, non- chemical pathways in microbial, invertebrate animal ribosomal peptide synthetase (NRPS) clusters, NRPS- independent siderophore (NIS) synthetase clusters and metagenomic datasets. The speed and conve- or hybrid clusters (1–4). These secondary metabolites nience of the package allows the annotation include polyketide antibiotics (e.g. erythromycin), of all PKS and NRPS clusters in a complete immuno-suppressants (e.g. rapamycin) and antiparasitics Actinobacteria genome in 2–3 man hours. The (e.g. avermectin) as well as peptide antibiotics (e.g. vanco- open architecture of ClustScan allows easy integra- mycin), immuno-suppressants (e.g. cyclosporin) and her- tion with other programs, facilitating further ana- bicides (e.g. bialaphos). Correlation of the chemical lyses of results, which is useful for a broad range structures of the products with cluster DNA sequences of researchers in the chemical and biological shows that, in most cases, a defined series of catalytic sciences. domains that can be grouped into modules are responsible

*To whom correspondence should be addressed. Tel: +385 1 4605013, Fax: +385 1 4836083; Email: [email protected]

ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 22 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Nucleic Acids Research, 2008, Vol. 36, No. 21 6883 for each round of chain elongation. Thus, synthesis fol- domains and the necessity of integrating data from lows a co-linear principle in which the gene sequences of many sources. Several tools have been developed to the individual modules determine the chemical outcomes assist this process. Identifying domains poses few pro- of successive chain extension reactions. Large-scale DNA blems as the sequences are well conserved. A much more sequencing has revealed many gene clusters, whose pro- difficult problem is predicting the activity and specificity of ducts are not known (5–7). Predictions about the struc- domains. The NRPS–PKS database (18, http://www.nii. tures of the products based on the DNA sequences res.in/nrps-pks.html), holds data on PKS and NRPS encoding enzyme modules can help decisions about gene clusters including module and domain structure which products may be interesting in the search for and chemical structures of the biosynthetic products. It novel drugs. Modules are composed of domains that allows users to input protein sequences to be used in carry out the different reactions so that prediction of BLAST (19) searches to identify domains and finds module specificity can be built up from that of domain the closest sequences in the database. This allows predic- specificity. In PKSs, each module usually contains an tion of whether an AT-domain uses malonyl–CoA or acyl carrier protein (ACP) domain and an acyl transferase methylmalonyl–CoA as a substrate (i.e. whether a C2 (AT) domain, which is responsible for substrate selection or C3 unit is incorporated into the polyketide). The and transferring the substrate to the ACP domain. For all ASMPKS database (20, http://gate.smallsoft.co.kr:8008/ modules except possibly the starter module (‘loading %7Ehstae/asmpks/index.html) uses a similar methodol- domain’) there is also a ketosynthase (KS) domain ogy, but integrates it with a graphical display of the that performs condensation. Some AT domains select a domains in genes so that modules can be easily recognized. malonyl–CoA substrate which results in a two carbon It also allows the display of a predicted linear polyketide extension. However, other substrates can be used chain product for which the user has to select starter and (e.g. methylmalonyl–CoA, ethylmalonyl–CoA, methoxy- extender units from lists. Minowa et al. (21) used an malonyl–CoA) which result in the incorporation of more approach based on the creation of hidden Markov carbon atoms. However, the backbone chain is always model profiles (22) to predict substrate specificity of AT extended by two carbon atoms and the other carbon domains. The company ECOPIA has also developed a atoms occur as side chains (e.g. methyl groups). Amino software tool (23) DecipherITTM, which helps annotation acid residues in AT domains that differ between malonyl– of new gene clusters based on comparison with a database CoA-incorporating and methylmalonyl–CoA-incorporat- of known clusters. Although these approaches are useful, ing have been identified from multiple alignments of AT they do not make predictions about the stereochemistry of sequences (8–12). There may be further reduction domains the products, which is extremely important for assessing that carry out a sequential reduction of the introduced their promise. As these analyses are essentially based on keto group: ketoreductase (KR) produces a hydroxyl similarity to known clusters rather than identification of group, which may be acted on by a dehydratase domain functional residues, they are less effective for clusters from (DH) to produce a double bond that can be modified to a novel organism groups. Another practical limitation is completely reduced product by an enoyl reductase (ER) that they do not export information about chemical struc- domain. The stereochemistry of the addition step is also tures in a format that can be used by standard programs important. This can arise when the KR domain introduces for further analyses. a hydroxyl group and comparison of the sequences of KR In this paper, we describe a program that utilizes recent domains introducing different stereochemistry identified advances in understanding the function of KR domains to specific residues correlated with this difference (13,14). A make knowledge-based predictions of activity and stereo- second source of differential stereochemistry is the incor- chemical specificity for hydroxyl groups and a-carbon poration of an extender unit with more than two carbon atoms. This is combined with a fingerprint approach to atoms resulting in a side chain with a choice of stereo- predict specificity of AT domains and more conventional chemistry. At one time it was assumed that the KS approaches for prediction of activity of DH and ER domain was responsible for this choice. However, bioin- domains. The program predicts the chemical structures formatic analyses could find no amino acid differences in of products, which can be exported in a SMILES/ the KS domain correlating with the stereochemical out- SMARTS format for further analysis by standard come and instead found correlations with the sequence of Chemistry programs. The program is structured so that the KR domain (15). Studies of the 3D structure of KR it can easily be updated to incorporate new knowledge domains provided mechanistic explanations of how the about the specificity of domains. It has a convenient gra- stereochemistry of the hydroxyl group and the a-carbon phical interface that allows the rapid semi-automatic atom are controlled (16,17). The chirality of the a-carbon annotation of gene clusters encoding modular biosynthetic is lost if reduction of the hydroxyl group to a double bond enzymes by non-expert users. on the b-carbon occurs, but this reduction may result in a new stereochemical choice between the cis- and trans- MATERIALS AND METHODS isomers that is probably determined by the DH domain carrying out the reduction. A new chirality may be created GeneMark (24) (version 2.5; http://opal.biology.gatech. if full reduction occurs and is likely to be determined by edu/GeneMark/) or Glimmer (25) (version 3.02; http:// the ER domain responsible for the final reduction step. www.cbcb.umd.edu/software/glimmer/) were used to The annotation of the DNA sequence of a PKS cluster identify genes. HMMER (22) (version 2.3.2; http:// can be time-consuming because of the large number of hmmer.janelia.org/) was used for identification of

23 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6884 Nucleic Acids Research, 2008, Vol. 36, No. 21 protein domains. Profiles from Pfam (26) as well as spe- find and annotate shikimic acid pathway genes; see cially constructed profiles were used. The gene prediction Performance of ClustScan subsection). Independently of and protein domain prediction programs run on a Linux the search for protein patterns, the DNA can be analyzed server and each user has a password to allow access to to find probable coding regions using GeneMark (24) or their own workspace. All user activities are performed Glimmer (25). GeneMark provides a library of models via the Java client, which was written in Windows, based on different bacteria and the appropriate model is MacIntosh and Linux versions. chosen using a species related to the source of the DNA. To predict the specificity of AT or KR domains the Glimmer can construct a model for coding regions using amino acid sequence was aligned with an appropriate long open reading frames (ORF) in the input sequence as HMMER profile and the diagnostic amino acid residues training data. This is less effective for short input extracted (Supplementary Data 1 Tables 2S and 3S). sequences. Also sequences with high G+C-content have The diagnostic residues were compared to fingerprints cor- long non-coding random ORFs, which may reduce the responding to the different specificities (substrate specifi- accuracy of coding sequence prediction. The program, city for AT; activity and stereochemistry for KR). The therefore, also allows the user to create a model by supply- prediction of activity/inactivity of DH domains used a ing appropriate training data (e.g. the genome sequence of HMMER-profile based on active actinomycete domains. a related species) and the model can be stored by the user The prediction was based on the HMMER score. ER for future analyses. domains were detected using a profile based on a mixture The results of the analyses are presented both as lists in of active and inactive domains. the ‘workspace’ window (Figure 1A) and graphically in To predict chemical structures, a table was constructed the ‘annotation’ editor window (Figure 1B). The work- (see Supplementary Data 1 Table 4S) that contained dif- space window shows the results in a tree format in ferent chemical building blocks written as isomeric which branches can be opened up or collapsed to show SMILES (27). These were ordered on the basis of substrate the genes and the protein domains. This is useful for and degree of reduction. In cases, where stereochemical obtaining an overview and it is possible to navigate prediction was not possible non-isomeric SMILES through the thousands of genes present in a complete were used. Generic units as SMARTS (http://www.day genome. The graphical ‘annotation’ editor window light. com/dayhtml/doc/theory/theory.smarts.html) were (Figure 1B) shows the positions of genes and protein also included for cases where prediction was not possible. domains on the six reading frames and can be viewed at The predictions were used to generate a description of the different resolutions using a zoom function. It is possible product in an XML format (http://www.w3.org/TR/xml/) using the mouse to displace genes and domains above organized in a hierarchical structure corresponding to and below the reading frames for better visualization of module and domain architecture. This XML description overlapping regions. It is usual to keep both the work- was used to generate the chemical structure from space window and the annotation editor window open the table of SMILES. This description was also used to and clicking on a feature in either, marks the correspond- generate a ring structure from the linear polyketide using ing feature in the other window. The protein domains a simple cyclization rule. The SMILES description can are identified by HMMER analysis using a cut-off score be drawn and displayed in ClustScan using Jmol that can be set to a stringent or relaxed value. This results v. 11.2.14, 2006 (Jmol; http://www.jmol.org/) or exported. in some putative protein domains, which may not be gen- Clustscan can be obtained by request from Novalis Ltd uine. The user can choose to reject a protein domain so ([email protected]). that it is removed from the analysis; the program tracks editing changes so that they can later be reversed if mis- taken deletions occur. In many cases, the decision about RESULTS the protein domain is taken on the basis of whether it occurs at an appropriate position with respect to other The analyses of the DNA sequence data are carried out on domains, which is easily seen in the graphical view of a server and the results are cached so that each analysis the annotation editor. To help the decision, the evidence only needs to be carried out once. This is important as the for the identification of the protein domains can be viewed analysis of a whole Streptomyces genome may take several using the ‘details’ window (Figure 2). This shows the coor- hours, but this can occur unsupervised overnight. The user dinates of the protein domain in the DNA and protein accesses the results using a Java client that gives user- and the scores and E-values from the HMMER analysis. friendly presentation of the data. There is a password- In addition, the alignment of the protein with the profile is protected workspace for each user on the server. The shown. A prediction of the specificity of the protein client allows the user to upload DNA sequences to the domain is also shown. For AT domains (Figure 2A) this server and initiate analyses. The sequence is automatically is the starter unit incorporated by the condensation reac- translated in all six reading frames to allow HMMER (22) tion. For KR domains (Figure 2B) it is predicted whether searches using a library of protein family profiles. The the domain is active for reduction and, in addition, the standard libraries contain PKS and/or NRPS domains, stereochemistry of the hydroxyl group and the a-carbon but it is possible to add other profiles if desired that atom are predicted. The predictions can be overridden if makes the program package generic. These can be profiles the user has extra information in conflict with from the Pfam (26) database or custom profiles created the program’s prediction. For instance, Figure 2A shows with the HMMER package (e.g. we have used profiles to the (correct) prediction of propionyl as the starter unit

24 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Nucleic Acids Research, 2008, Vol. 36, No. 21 6885

Figure 1. (A) The workspace window gives an overview of the analysis in the form of collapsible trees. Detected genes and protein domains are shown. (B) The annotation editor window shows the location of genes (in red) and protein domains (in blue). In this case there are three genes on the three different forward open reading frames. The genes have been displaced from the reading frames by the user to allow better visualization of the domains. The annotation editor has been used for user definition of modules (shown as red curves below the open reading frames). (C) The cluster editor window. The user can define a set of contiguous genes as a cluster. The cluster editor window shows the genes in a cartoon form with an expanded view of the selected gene showing protein domains. Domains can be linked together to give modules. The modules are given identifying names and the program suggests a biosynthetic order that can be accepted or altered by the user. for erythromycin. By clicking on the propionyl, a The complete annotation by ClustScan can be stored as drop-down list is shown that enables selection of an alter- a file in an XML format so that it can be reimported into native unit. ClustScan. The hierarchical nature of XML makes it well On the basis of the results in the annotation editor, suited for representing clusters in terms of genes, modules the user can define a gene cluster covering a region of and protein domains. We developed an XML format that adjacent genes. The annotation of the gene cluster is car- includes information about the biosynthetic order. ried out using the ‘cluster’ editor window (Figure 1C), Although the XML format is primarily designed for the which shows the genes of the gene cluster in a simple internal use of ClustScan, it makes it easy for other appli- cartoon form, hence the term semi-automatic. When a cations to read or write ClustScan compatible files by gene is selected, the protein domains are also shown. adding an appropriate XML parser. In addition to the Modules can be assembled by marking protein domains XML format, annotations can be exported as an EMBL and each module created is given a name. The program or GenBank file for use in other applications or for sub- suggests a biosynthetic order of the genes of a cluster. mission to databases; this results in loss of information on For PKS clusters this is based on identifying a potential biosynthetic order. In addition, the DNA or amino acid loading domain (i.e. typically a module containing only sequences of genes, domains or modules can be copied to AT and ACP domains; Figure 1C) and looking for a the clipboard for further analyses with other programs. thioesterase domain as identifying the last module. If The prediction functions for the activity and specificity there is ambiguity, it is assumed that the genes are used of protein domains are used to deduce module specificity in the pathway in the same order as they occur in the and, thus, to predict the chemical structure of the linear DNA. This procedure identifies the correct biosynthetic polyketide chain product of the gene cluster. The struc- order in most natural gene clusters. The user can alter tures are represented internally in the program as isomeric the suggested order to incorporate any additional knowl- SMILES (27), which can be copied to the clipboard edge available. (Figure 3A) allowing export for use with standard

25 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6886 Nucleic Acids Research, 2008, Vol. 36, No. 21

A SMILES: [C@H](C)[C@@H](O)[C@@H](C) [C@H](O)[C@H](C)C(=O)C(C)C [C@@H](C)[C@H](O)[C@@H](C) [C@H](O)C(C)C(=O)S

B

C

Figure 3. The molecules window. (A) The SMILES description for the linear backbone of erythromycin predicted from the DNA sequence of the cluster. The SMILES description can be copied to the clipboard for Figure 2. The details window allows the user to examine the evidence export. (B) The 3D structure of the predicted linear chain is shown. The for assignment of protein domains. The HMMER scores and E-values mouse can be used to rotate the molecule. (C) The ring structure of the as well as the alignment are displayed. The predictions of activity erythromycin aglycone as predicted using the cyclization function of the and specificity are also displayed and can be modified by the user. program. (A) The loading AT domain of the erythromycin cluster. The pro- gram makes the correct prediction of a propionyl starter unit. By click- ing on this choice, a selection window has been opened that allows chemical software. The user can define new module speci- the user to override the automatic prediction and select an alter- ficities and provide isomeric SMILES descriptions of the native choice. (B) The KR domain of module 3 of the erythromycin extender units. It is possible to edit the prediction of cluster. module specificity to allow incorporation of such novel

26 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Nucleic Acids Research, 2008, Vol. 36, No. 21 6887 extender units. The program allows the user to display the stereochemistry. In one case, the program predicted the chemical structure of products in a 3D ‘molecules’ window incorrect side chain stereochemistry (A1 instead of A2). (Figure 3B) in which the molecule can be rotated. The In the other four cases, the alignment with the profile did program can also produce a potential cyclic structure not yield an amino acid fingerprint that fell into any of the from a linear molecule (Figure 3C). It is assumed that groups: in these cases the program indicates that no pre- the first hydroxyl or amino group introduced during diction is possible. Thus, the KR prediction was correct in synthesis reacts with the terminal extender unit. 88% of the cases, incorrect in 4% of the cases and the The program is designed to allow easy incorporation of program was unable to provide a prediction in 8% of new knowledge. New or modified prediction of enzyme the cases. activity or specificity can be implemented without chan- Unlike the case of KR, structural information about AT ging program structure. It is also possible for sophisticated domains is not sufficient to help in substrate prediction. users to write their own specific scripts to introduce spe- The most common extender substrates are malonyl–CoA cialized prediction functions. and methylmalonyl–CoA. Comparison by eye of align- ments of AT domain sequences identified 13 amino acid Prediction of domain activities residues, which differed significantly between domains incorporating the two substrates (8–12). The amino acid The presence of any of the seven domains KS, AT, ER, sequences of nine AT extender domains that incorporated DH, KR, ACP or TE is detected using the HMMER ethylmalonyl–CoA were examined. It was found that the profiles. An extender module needs at least KS, AT and 13 amino acid residues had a common pattern that dif- ACP. AT determines the substrate selection for the exten- fered from those of the malonyl–CoA and methylmalo- sion reaction. The three reduction domains (KR, DH and nyl–CoA-specific AT domains. This information was ER) may be absent or present as active or inactive used for prediction of specificity in the program. A further domains. ClustScan predicts whether domains are active known extender substrate is methoxymalonyl–CoA and as well as predicting substrate specificity or stereochemical specific residues associated with choice of this substrate outcome when several outcomes are possible (see were identified in AT domains of the concanamycin Supplementary Data 1 Table 1S). A cluster (28). Eleven methoxymalonyl-incorporating The KR domain is the best characterized domain AT domains were examined, but the 13 fingerprint resi- in terms of structural determination of differential activ- dues used to characterize the other substrates did not ities. Active KR domains determine the chirality of the show a conserved pattern. It was noticed that most had hydroxyl product and bioinformatic analysis identified insertions with respect to the conserved alignment of all amino acid residues involved in this choice (13,14). AT domains, which caused problems in identifying poten- Bioinformatics also suggested that KR rather than KS tial fingerprinting residues. After using a specific align- determined the stereochemistry of b-carbon groups, ment for methoxymalonyl–CoA-incorporating AT when C3 or C4 units are incorporated (15). A comparison domains, it was possible to use a modified form of the of 3D structures of two KR domains of different specifi- published pattern (28) to predict methoxymalonyl–CoA city gave more detailed information on amino acid resi- as a substrate. dues involved in determining both hydroxyl and b-carbon The information about AT extender specificity was stereochemistry (17). In ClustScan, alignment with a KR implemented in ClustScan. The amino acid sequence of profile allows identification of all of these critical amino the AT domain was aligned with a general AT-profile to acids (the ‘fingerprint’) and, thus prediction of the pro- identify the 13 diagnostic amino acid residues. These were duct. The fingerprints used are shown in the compared to three fingerprints corresponding to the three Supplementary Data 1 Table 2S. There are six possible substrates. If the amino acids did not fit any of the three products (A1, A2, B1, B2, C1, C2), which correspond to fingerprints, the AT domain was aligned using a profile three possible ketoreduction outcomes (either hydroxyl derived from the 11 methoxymalonyl–CoA AT domains. stereoisomer—A or B, or no reduction C) coupled with This alignment was used to test if one of the characterized two b-carbon chiralities (called 1 and 2). The accuracy of insertions was present. If no match was found, the AT prediction was tested using 49 KR domains for which the domain was assigned to an unknown substrate category. structure of the polyketide product provides information In addition to AT domains in extender modules, there about activity and stereochemistry; if further active reduc- are often AT domains in loading domains. A set of AT tion domains are present, the product does not provide domains that incorporate acetyl starters (nine domains), any information about the stereochemistry of the KR propionyl starters (eight domains) or methylbutyryl step. Ten of the KR domains processed 2-carbon extender starters (three domains) were aligned with the general units so that only hydroxyl stereochemistry was relevant: AT profile and the 13 diagnostic amino acid residues all 10 predictions were correct. Nine of the KR extracted. The fingerprints for acetyl and propionyl star- domains processing 3- or 4-carbon extender units were ters were identical to those for acetyl and propionyl exten- inactive: in eight cases the program predicted that the ders, respectively. The methylbutyryl starters showed a domains were inactive for reduction and also predicted different pattern, which was also used to construct a spe- the correct side chain stereochemistry. In one case the cific fingerprint. This information was incorporated into inactive KR domain was predicted as active. The other ClustScan. When the user accepts the suggested biosyn- 30 KR domains processing 3- or 4-carbon extender units thetic order or defines a different order the loading domain were active. In 25 cases the program predicted the correct is subjected to a special analysis. If an AT domain is

27 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6888 Nucleic Acids Research, 2008, Vol. 36, No. 21 present, the 13 diagnostic amino acids are extracted and Performance of ClustScan tested for acetyl, propionyl or methylbutyryl fingerprints. There are two main criteria for the usefulness of All fingerprints for the specificity of AT starter and exten- ClustScan: the accuracy of prediction and the speed and der units are shown in Supplementary Data 1 Table 3S. A convenience of annotating large datasets. The accuracy of dataset of 196 known AT domains was analyzed with prediction was tested on two well-known gene clusters: the ClustScan (95 malonyl–CoA, 79 methylmalonyl–CoA, 9 erythromycin gene cluster and the niddamycin gene cluster ethylmalonyl–CoA and 13 methoxymalonyl–CoA. The (GenBank accession numbers AY771999 and AF016585). remaining 25 were propionyl, acetyl, methylbutyryl and For the erythromycin gene cluster, with one exception, all some unusual ones from loading domain ATs). This the protein domains of the six extender modules were gave the correct prediction in 182 cases (93%), the accurately identified and the propionyl starter wrong prediction in 9 cases (5%) and assignment to an (Figure 2A) was also predicted. The only exception was unknown class in 5 cases (3%). that ClustScan was not able to predict the hydroxyl group For DH domains, the prediction should distinguish stereochemistry of the KR domain of module 4; the pre- active and inactive domains. As insufficient structural diction of the hydroxyl stereochemistry is flagged as information was available to make predictions based on unknown. This does not have an effect on the final pre- knowledge of function, it was decided to use a profile diction as an active DH domain forms a double bond. based on active domains to try to predict activity. The However, the active ER domain recreates a chiral center, profile was built using the sequences of 57 active domains which cannot be predicted with the current state of knowl- derived from actinomycetes. The profile was used to edge. This resulted in two possible structures, where the screen the active domains used in its construction as well user can choose the correct chirality to obtain an accurate as an additional 56 active and 46 inactive domains prediction of the chemistry of the linear backbone > (159 total). All domains with a high score ( 300) were (Figure 3A and B). In this case, the cyclization was also < active, whereas all with a low score ( 200) were inactive. predicted correctly (Figure 3C) (see also Supplementary About 80% of the domains with intermediate scores were Data 1 Figure 1S A and B). In the niddamycin gene clus- active, but the scores of inactive domains were distributed ter, the five genes, the loading domain and the seven exten- through the range. These results were used to define a der modules containing 36 catalytically active domains prediction function with three outcomes: active (score were all correctly predicted with the exception that the >300), 80% probability of activity (scores between substrate for module six was predicted as ethylmalonate 200 and 300) and inactive (score < 200). This prediction instead of the correct methoxymalonate. The inactive KR function was tested on 159 domains (113 active and 46 in module 4 responsible for the b-carbon: S stereochemis- inactive). Forty-six domains fell into the intermediate try was predicted. The correct cyclization was also pre- region (36 active, 10 inactive) with a prediction of 80% dicted (see Supplementary Data 1 Figure 2S A and B). probability of activity. Sixty-seven domains were pre- The results with ClustScan were compared with those dicted to be active of which six were in fact inactive from the NRPS–PKS database prediction system corresponding to a 9% false prediction rate. In contrast, (SEARCHPKS), which is the most popular current ana- the prediction of inactivity was less satisfactory: 43 DH lysis tool for PKS clusters (see Supplementary Data 2 domains were predicted to be inactive of which 16 were Figures 1–4). SEARCHPKS (http://www.nii.res.in/nrps- actually active. A closer examination of these false predic- pks.html) requires protein sequences so the amino acid tions showed that only 1 of the 16 was an actinomycete sequences of the genes were extracted with ClustScan sequence, the other 15 being sequences from Gram- and submitted. SEARCHPKS found two extra false negative bacteria. When attention was confined to actino- positive ACP domains in the erythromycin cluster mycete sequences, 13 DH domains were predicted to be (Supplementary Data 2 Figure 1). The first at the end of inactive of which only one was active. the eryAI gene did not affect the prediction as it was an Initially, a similar approach to that used for the DH isolated ACP domain. The second occurred between the domain was attempted with the ER domains. A profile KS and AT domains of module five and resulted in the was constructed using active actinomycete ER domains. program predicting an additional module and making no However, it was found that better prediction was achieved prediction of the chemistry of the two modules generated. with a profile based on a mixture of active and inactive It is not possible to review the data behind the prediction domains. Sixty-six known ER domains were tested. In all or to manually reject the false positives. SEARCHPKS cases the ER domain was detected. The HMMER score found all the other domains successfully, but does not did not prove useful in distinguishing between active and attempt to make predictions of the activity or stereochem- inactive ER domains. However, there were only three istry of the reduction domains. In particular, this results in cases of inactive ER domains in the presence of an the false prediction of an active KR domain in module active DH domain. There were four cases in which an 3 resulting in the prediction of a hydroxyl group rather ER domain was detected, but the DH domain was inac- than the correct keto group. The substrate choice of the tive. The program, therefore, predicts an active ER loading domain was not predicted, but there was correct domain if a domain is found and there are active KR prediction of a C3 unit for five of the six extender mod- and DH domains present. This gives a false prediction ules; no prediction of substrate was possible for module 5. in the 3/66 (5%) cases of an inactive ER domain with For niddamycin (Supplementary Data 2 Figure 2) there an active DH domain. was also a false prediction of an additional ACP domain

28 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Nucleic Acids Research, 2008, Vol. 36, No. 21 6889 in module 2. This results in a false positive prediction of annotation agreed with the published annotation and an additional module and an inability to predict the che- extended it with predictions of domain activity and stereo- mical structures associated with the two ‘modules’. In chemistry of products. ClustScan has been used to anno- module 4, the KR domain was incorrectly predicted as tate DNA sequences from a variety of bacterial species active and the substrate for module 5 could not be pre- including cyanobacteria. dicted. Like ClustScan the wrong substrate for module 6 ClustScan is mainly designed for use with bacterial was predicted. sequences. However, the more general utility of Eight further well-characterized clusters were anno- ClustScan program package was demonstrated by the ana- tated. For the megalomycin, pimaricin and tylactone clus- lysis of lower eukaryote sequences, where intron predic- ters the predicted module activities were in full agreement tion is often difficult. An example is provided by the slime with the published results. For tylactone (Supplementary mould Dictyostelium discoideum which has 45 PKS genes Data 2 Figure 3) SEARCHPKS found all the domains, (29), which were annotated poorly by the standard anno- but is unable to predict activity of reduction domains; tation methods used in the genome project. Using thus, it predicts a chemistry based on an active KR in ClustScan it was possible to use local HMMER profiles module 4, whereas ClustScan correctly identifies the for the protein domains, which are effective in recognizing domain as inactive. Also, the starter unit is not correctly segments of the domains split by introns. When such an predicted. The worst results for ClustScan were obtained analysis is carried out, a PKS gene shows a characteristic with the rifamycin cluster, where the stereochemistry of signature with parts of protein domains in the correct three methyl groups could not be predicted and two order with gaps due to introns in between. The view in of eight DH domains were falsely predicted as active. the annotation editor window allows easy recognition of In comparison, SEARCHPKS falsely predicts five genes and the coordinates of the domain segments help in DH domains and one KR domain as active and does detecting the intron boundaries. not attempt to predict the stereochemistry (see ClustScan is mainly designed for the annotation of gene Supplementary Data 2 Figure 4). For the other four clus- clusters encoding modular biosynthetic enzymes, but it ters (amphotericin, avermectin, nystatin and oleandomy- can also be used for annotating other genes by loading cin) there were fewer errors (data not shown). Six appropriate HMMER profiles. For instance, we have additional domains were identified, which were not pre- used seven profiles to find and annotate shikimic acid sent in the published annotations. Two were TE domains; pathway genes in a marine organism (30). Recently there as the presence of a TE domain does not directly affect the has been intensive activity with metagenomic sequences. structure of the compound, it is likely that previous anno- The source organisms for sequences are not known, but tation work had not searched carefully for these domains. they contain genomes from a number of culturable and The other four new domains were all DH domains with non-culturable microorganisms. The contigs are often significant deletions (a third to a half of the length). They fairly small and the quality of the sequence is sometimes are, thus, predicted as inactive by ClustScan. Although poor. These problems make an analysis using HMMER such partially deleted domains are not important for pre- diction of product structure, they are interesting for stu- local profiles attractive. We used ClustScan to analyze a dies on the evolution of clusters. 200 kb DNA sequence (AACY020563593) from the A major problem with annotations in DNA database J. Craig Venter Institute Global Ocean Sampling (GOS) entries is that they are not uniform, but differ according to Expedition metagenomic dataset (31). This revealed a the person carrying out the annotation. ClustScan helps potential PKS–NRPS hybrid gene cluster of about 50 kb achieve a uniform annotation standard and we have rean- in size (Figure 4). It starts with an NRPS loading module, notated published sequences to achieve a standard defini- followed by three PKS modules and seven NRPS tion of domain boundaries and description of units. modules and ends with an NRPS thioesterase domain. ClustScan has been used to annotate successfully more However, closer examination of the domain distribution than 50 modular gene clusters from a variety of genomes between reading frames reveals several cases where and metagenomes; full details are available on request. domains forming a single module appear to be present The speed and convenience of ClustScan were assessed in different neighboring genes. This is due to three appar- using the genome sequences of Saccharopolyspora ery- ent frameshifts and the anomalous occurrence of a thraea (7) which is 8.2 Mb in size. A graduate student stop codon, which probably arise due to sequencing was able to annotate the PKS and NRPS clusters in errors. Thus, it seems likely that there are three genes 2–3 h of work (the initial analysis using HMMER can rather than the seven genes indicated by both GenMark take several hours of run time on the server, but this and Glimmer analysis. In the case of two of the potential occurs unsupervised overnight). The ClustScan annotation PKS modules, no AT domains are recognized, but there identified genes, modules and protein domains and are unassigned regions in the protein of appropriate sizes included prediction of activity, substrate specificity and and locations for AT domains (Figure 4). Thus, the stereochemical outcome for PKS domains. The published program allows rapid scanning of metagenomic datasets annotation (7) identified genes, modules and protein and makes it easy to identify potential sequencing errors domains and, in addition, the AT domains are assigned and interesting features of clusters. With the growing to malonyl–CoA and methylmalonyl–CoA-incorporating importance of metagenomic data for drug discovery pro- classes. However the stereochemistry and activity of grams ClustScan helps to eliminate a major bottleneck in reduction domains are not annotated. The ClustScan the analysis.

29 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6890 Nucleic Acids Research, 2008, Vol. 36, No. 21

Figure 4. Annotation editor window showing the analysis of a potential PKS–NRPS hybrid cluster from a marine metagenomic sequence. The following coloring is used: genes (red), PKS protein domains (green) and NRPS protein domains (blue). Although seven genes are shown, the distribution of domains between genes suggest that sequencing errors have occurred. The three boxes indicate the positions of the probable genes. The first gene has one frameshift, the second gene has two frameshifts and the third gene has an anomalous stop codon (ringed in black) in it. The positions where two AT domains would be expected are also ringed (in yelow).

DISCUSSION method similar to that of Minowa et al. (21) based on ClustScan is easy to use and allows rapid annotation of HMMER profiles of critical amino acids to predict AT new gene clusters. This is very important for exploitation specificity. However, this approach gave lower accuracy of prediction than the fingerprint method that we used of the rapidly accumulating data from large-scale DNA subsequently. For both the KR and AT domains, the fre- sequencing projects. The facts that high-quality annota- quency of false prediction was low (4%). It was striking tion with traditional methods is very time consuming that good results were obtained for both Gram-negative and needs a high degree of experience have prevented sequences as well as for the majority of Gram-positive full exploitation of the extensive DNA database to identify actinomycete sequences. This supports the idea that the potentially interesting biosynthetic enzyme clusters. diagnostic residues in AT domains are functional in sub- ClustScan Although is easy to use, it also allows the user strate specificity rather than being evolutionary accidents. to customize the result and override the automatic predic- In contrast, the DH activity prediction, which was based tions. It is also designed to allow easy incorporation of on an actinomycete profile was only efficient for actino- new knowledge to improve predictive power. The server– mycetes. In particular, many active Gram-negative DH client architecture means that such improvements as well domains were predicted to be inactive. This means that as changes to reflect new versions of the standard analysis the profile mismatch is caused by the evolutionary dis- programs are implemented on the server and do not need tance. Although it would be possible in the short term changes in the client programs installed on users’ compu- to improve DH prediction using profiles for specific ters. An important goal in the design of ClustScan was to groups of organisms, the identification of important func- give it an open architecture which would allow easy inte- tional amino acid residues would give predictions less gration with other programs. The definition of an XML dependent on evolutionary distance. In contrast to other format for full gene cluster description allows interchange annotation programs (18,20,21,23,32), ClustScan predicts with other programs by simply adding an appropriate the stereochemistry of products. The dependence on func- XML parser. The export of annotation as EMBL or tional residues in the KR and AT domains makes it espe- GenBank formats and the export of chemical structures cially valuable for novel gene clusters that are not closely as SMILES (27) facilitate further analyses of results gen- related to known gene clusters. Such clusters are especially erated by ClustScan. interesting in the search for novel drugs. We have not Knowledge about PKS protein domains is used to make implemented specificity predictions for NRPS protein predictions about chemical structure. In the case of the domains. However, there is some information available KR domain (14) there is detailed knowledge about protein to allow partial prediction (32). When the prediction structure and the role of the small number of amino acid power is good enough it will be easy to add NRPS pre- residues that control reductase activity and stereospecifi- dictions to ClustScan and predict the chemical structure of city. In the case of AT extender domains 13 amino acid products. We compared the performance of ClustScan to residues that correlate with the choice between malonyl– that of the SEARCHPKS prediction program of the CoA and methylmalonyl–CoA substrates were known NRPS–PKS database (18). This is less convenient to use (8–12). We found that these 13 amino acids could also as the genes must be identified and the deduced protein be used to predict ethylmalonyl–CoA substrate. The sequence input to the program. The output of predicted incorporation of methoxymalonyl–CoA substrates was chemistry is not available in a standard chemical format. correlated with insertions. Initially, we tried to use a SEARCHPKS often predicts additional ACP domains

30 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Nucleic Acids Research, 2008, Vol. 36, No. 21 6891 that prevent accurate prediction of product chemistry. We full utilization of the rapidly increasing DNA sequence also observed that BLAST (19) searches often gave pro- data. Studies on the evolution of secondary metabolite blems in identifying ACP domains, whereas no problems clusters (36) can reveal biological constraints on the struc- were encountered with HMMER (22); this is probably tures that can be attained; such studies are greatly assisted because of the short length of ACP domains. The predic- by the ability to rapidly and accurately annotate new tion of the specificity of extender AT domains is relatively clusters. good; this probably reflects the fact that AT specificity We used a top-down approach based on HMM models correlates well with phylogenetic trees (33) so that, in addi- to annotate gene clusters encoding modular biosynthetic tion to critical functional amino acids, there are other enzymes. We showed that by choice of appropriate pro- amino acids that differ for evolutionary reasons. As the files, ClustScan could also be used for annotating other BLAST program does not weight residues according to primary and secondary metabolic pathways in a variety conservation, it works best when differences at many resi- of microbial and invertebrate organisms. It seems likely dues correlate with activity. SEARCHPKS does not give that extensions of this approach could be useful for more good prediction of loading module specificities. It does not general annotation tasks. attempt to predict activity or stereochemistry of domains. As these predictions involve a small number of critical residues, they could not be effectively implemented using SUPPLEMENTARY DATA a BLAST-based approach. The ASMPKS database (20) Supplementary Data are available at NAR online. could not be meaningfully compared to ClustScan as it’s gene prediction for clusters with high G+C-content was very poor and it requires a DNA input. This is because it ACKNOWLEDGEMENTS uses the Glimmer (25) program to predict genes and builds an HMM-model from input data. In ClustScan, we imple- We would like to thank Janko Diminic and Vedran Rodic mented the use of custom HMM-models to overcome this for programming and Professor Arnold Demain for help- difficulty for subgenomic sequences. As the ASMPKS ful advice and encouragement. J.S. declares a financial implements a similar approach to the NRPS–PKS data- interest related to this work with regard to the potential base, it is likely that similar results would be found if this future commercial marketing of the software and technical problem were overcome. databases. There are at least 15 known starters used by different modular PKSs. In many cases there is no AT domain in the loading domain. Acetyl, propionyl and methylbutyryl FUNDING starters can be loaded by AT domains and it was found Ministry of Science, Education and Sports, Republic of that they could be distinguished using diagnostic amino Croatia (grant 058-0000000-3475 to D.H.); the German acid residues. It was striking that the acetyl and propionyl Academic Exchange Service (DAAD) and the Ministry starter AT domains showed the same patterns as of Science, Education and Sports, Republic of Croatia the malonyl–CoA and methylmalonyl–CoA extender cooperation grant (to D.H. and J.C.); the Leverhulme domains. It is known that in some cases an acetyl starter Trust; Japanese Bio-Industry Association; The School of is derived from decarboxylation of a malonyl–CoA sub- Pharmacy, University of London (to P.F.L.). strate, but in other cases acetyl–CoA is the substrate (34). The fact that the commonest extenders’ AT domains are Conflict of interest statement. J.S. declares a financial inter- closely related to starter AT domains suggests that it est related to this work with regard to the potential future might be possible to evolve new PKS gene clusters from commercial marketing of the software and databases. truncated clusters that have lost the starter module. Most polyketides undergo cyclization. In ClustScan we have implemented a simple rule of cyclization by interac- REFERENCES tion of the first hydroxyl or amino group with the terminal 1. Challis,G.L. (2005) A widely distributed bacterial pathway for group. This applies to many natural polyketides and raises siderophore biosynthesis independent of nonribosomal peptide the hope that a simple rule-based method can make cor- synthetases. Chembiochem, 6, 601–611. rect predictions in many cases. Prediction of cyclization is 2. Finking,R. and Marahiel,M.A. (2004) Biosynthesis of non- ribosomal peptides. Ann. Rev. Microbiol., 58, 453–488. important to obtain the full benefit of product prediction. 3. Hranueli,D., Cullum,J., Basrak,B., Goldstein,P. and Long,P.F. The ability to rapidly acquire knowledge of new gene (2005) Plasticity of the Streptomyces genome - evolution and engi- clusters from their DNA sequence has a variety of impli- neering of new antibiotics. Curr. Med. Chem., 12, 1697–1704. cations in the search for pharmacologically relevant com- 4. Weissman,K.J. and Leadlay,P.F. (2005) Combinatorial biosynthesis pounds. The identification of novel gene clusters with of reduced polyketides. Nat. Rev. Microbiol., 3, 925–936. 5. Bentley,S.D., Chater,K.F., Cerdeno-Tarraga,A.M., Challis,G.L., interesting and unusual product chemistry will direct the Thomson,N.R., James,K.D., Harris,D.E., Quail,M.A., Kieser,H., choice of targets for lead discovery. Another application Harper,D. et al. (2002) Complete genome sequence of the model of the new sequences is to use them to construct new actinomycete Streptomyces coelicolor A3(2). Nature, 417, 141–147. polyketides based on known modules in silico; i.e. use 6. Ikeda,H., Ishikawa,J., Hanamoto,A., Shinose,M., Kikuchi,H., Shiba,T., Sakaki,Y., Hattori,M. and Omura,S. (2003) Complete them as input for a program such as the Biogenerator genome sequence and comparative analysis of the industrial program (35). ClustScan will help eliminate the bottleneck microorganism Streptomyces avermitilis. Nat. Biotechnol., 21, posed by the annotation of DNA sequences and allow the 526–531.

31 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6892 Nucleic Acids Research, 2008, Vol. 36, No. 21

7. Oliynyk,M., Samborskyy,M., Lester,J.B., Mironenko,T., Scott,N., 23. Zazopoulos,E., Huang,K., Staffa,A., Liu,W., Bachmann,B.O., Dickens,S., Haydock,S.F. and Leadlay,P.F. (2007) Complete Nonaka,K., Ahlert,J., Thorson,J.S., Shen,B. and Farnet,C.M. genome sequence of the erythromycin-producing bacterium (2003) A genomics-guided approach for discovering and expressing Saccharopolyspora erythraea NRRL23338. Nat. Biotechnol., 25, cryptic metabolic pathways. Nat. Biotechnol., 21, 187–190. 447–453. 24. Besemer, J. and Borodovsky, M. (2005) GeneMark: web software 8. Haydock,S.F., Aparicio,J.F., Molnar,I., Schwecke,T., Khaw,L.E., for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Ko¨nig,A., Marsden,A.F., Galloway,I.S., Staunton,J. and Acids Res., 33, Web Server issue W451–454. Leadlay,P.F. (1995) Divergent sequence motifs correlated with the 25. Delcher,A.L., Bratke,K.A., Powers,E.C. and Salzberg,S.L. (2007) substrate specificity of (methyl)malonyl-CoA: acyl carrier protein Identifying bacterial genes and endosymbiont DNA with Glimmer. transacylase domains in modular polyketide synthases. FEBS Lett., Bioinformatics, 23, 673–679. 374, 246–248. 26. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., 9. Lau,J., Fu,H., Cane,D.E. and Khosla,C. (1999) Dissecting the role Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and of acyltransferase domains of modular polyketide synthases in the Sonnhammer,E.L. (2002) The Pfam protein families database. choice and stereochemical fate of extender units. Biochemistry, 38, Nucleic Acids Res., 30, 276–280. 1643–1651. 27. Weininger,D. (1988) SMILES, a chemical language and information 10. Reeves,C.D., Murli,S., Ashley,G.W., Piagentini,M., system. 1. Introduction to methodology and encoding rules. Hutchinson,C.R. and McDaniel,R. (2001) Alteration of the sub- J. Chem. Inf. Comput. Sci., 28, 31–36. strate specificity of a modular polyketide synthase acyltransferase 28. Haydock,S.F., Appleyard,A.N., Mironenko,T., Lester,J., Scott,N. domain through site-specific mutations. Biochemistry, 25, and Leadlay,P.F. (2005) Organization of the biosynthetic gene 15464–15470. cluster for the macrolide concanamycin A in Streptomyces neyaga- 11. Del Vecchio,F., Petkovic,H., Kendrew,S.G., Low,L., Wilkinson,B., waensis ATCC 27449. Microbiology, 151, 3161–3169. Lill,R., Cortes,J., Rudd,B.A.M., Staunton,J. and Leadlay,P.F. 29. Zucko,J., Skunca,N., Curk,T., Zupan,B., Long,P.F., Cullum,J., (2003) Active-site residue, domain and module swaps in modular Kessin,R. and Hranueli,D. (2007) Polyketide synthase genes and polyketide synthases. J. Ind. Microbiol. Biotechnol., 30, 489–494. the natural products potential of Dictyostelium discoideum. 12. Yadav,G., Gokhale,R.S. and Mohanty,D. (2003) Computational Bioinformatics, 23, 2543–2549. approach for prediction of domain organization and substrate 30. Starcevic,A., Akthar,S., Dunlap,W.C., Shick,J.M., Hranueli,D., specificity of modular polyketide synthases. J. Mol. Biol., 328, Cullum,J. and Long,P.F. (2008) Enzymes of the shikimic acid 335–363. pathway encoded in the genome of a basal metazoan, Nematostella 13. Caffrey,P. (2003) Conserved amino acid residues correlating with vectensis, have microbial origins. Proc. Natl Acad. Sci. USA, 105, ketoreductase stereospecificity in modular polyketide synthases. 2533–2537. Chembiochem, 4, 654–657. 31. Rusch,D.B., Halpern,A.L., Sutton,G., Heidelberg,K.B., 14. Reid,R., Piagentini,M., Rodriguez,E., Ashley,G., Viswanathan,N., Williamson,S., Yooseph,S., Wu,D., Eisen,J.A., Hoffman,J.M., Carney,J., Santi,D.V., Hutchinson,C.R. and McDaniel,R. (2003) Remington,K. et al. (2007) The Sorcerer II Global Ocean Sampling A model of structure and catalysis for ketoreductase domains in expedition: northwest Atlantic through Eastern Tropical Pacific. modular polyketide synthases. Biochemistry, 42, 72–79. PLoS Biol., 5, e77. 15. Starcevic,A., Cullum,J., Jaspars,M., Hranueli,D. and Long,P.F. 32. Rausch,C., Weber,T., Kohlbacher,O., Wohlleben,W. and (2007) Predicting the nature and timing of epimerisation on a Huson,D.H. (2005) Specificity prediction of adenylation domains modular polyketide synthase. Chembiochem, 8, 28–31. in nonribosomal peptide synthetases (NRPS) using transductive 16. Castonguay,R., He,W., Chen,A.Y., Khosla,C. and Cane,D.E. support vector machines (TSVMs). Nucleic Acids Res., 33, (2007) Stereospecificity of ketoreductase domains of the 5799–5808. 6-deoxyerythronolide B synthase. J. Am. Chem. Soc., 129, 33. Jenke-Kodama,H., Bo¨rner,T. and Dittmann,E. (2006) Natural 13758–13769. biocombinatorics in the polyketide synthase genes of the 17. Keatinge-Clay,A.T. (2007) A tylosin ketoreductase reveals how actinobacterium Streptomyces avermitilis. PLoS Comput. Biol., 2, chirality is determined in polyketides. Chem. Biol., 14, 898–908. e132. 18. Yadav,G., Gokhale,R.S. and Mohanty,D. (2003) SEARCHPKS: a 34. Long,P.F., Wilkinson,C.J., Bisang,C.P., Cortes,J., Dunster,N., program for detection and analysis of polyketide synthase domains. Oliynyk,M., McCormick,E., McArthur,H., Mendez,C., Salas,J.A. Nucleic Acids Res., 31, 3654–3658. et al. (2002) Engineering specificity of starter unit selection by the 19. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. erythromycin-producing polyketide synthase. Mol. Microbiol., 43, (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. 1215–1225. 20. Tae,H., Kong,E.B. and Park,K. (2007) ASMPKS: an analysis system 35. Zotchev,S.B., Stepanchikova,A.V., Sergeyko,A.P., Sobolev,B.N., for modular polyketide synthases. BMC Bioinformatics, 8, 327. Filimonov,D.A. and Poroikov,V.V. (2006) Rational design of 21. Minowa,Y., Araki,M. and Kanehisa,M. (2007) Comprehensive macrolides by virtual screening of combinatorial libraries generated analysis of distinctive polyketide and nonribosomal peptide struc- through in silico manipulation of polyketide synthases. J. Med. tural motifs encoded in microbial genomes. J. Mol. Biol., 368, Chem., 49, 2077–2087. 1500–1517. 36. Fischbach,M.A., Walsh,C.T. and Clardy,J. (2008) The evolution of 22. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, gene collectives: How natural selection drives chemical innovation. 755–763. Proc. Natl Acad. Sci. USA, 105, 4601–4608.

32 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

2.3 Spontaneity in the patellamide biosynthetic pathway

Bruce F. Milne, Paul F. Long, Antonio Starcevic, Daslav Hranueli and Marcel Jaspars. Org. Biomol. Chem., 4, 631-638, 2006.

Abstract:

Post-translationally modified ribosomal peptides are unusual natural products and many have potent biological activity The biosynthetic processes involved in their formation have been delineated for some, but the patellamides represent a unique group of these metabolites with a combination of a macrocycle, small heterocycles and D-stereocentres. The genes encoding for the patellamides show very low homology to known biosynthetic genes and there appear to be no explicit genes for the macrocyclisation and epimerisation steps. Using a combination of literature data and large-scale molecular dynamics calculations with explicit solvent, we propose that the macrocyclisation and epimerisation steps are spontaneous and interdependent and a feature of the structure of the linear peptide. Our study suggests the steps in the biosynthetic route are heterocyclisation, macrocyclisation, followed by epimerisation and finally dehydrogenation. This study is presented as testable hypothesis based on literature and theoretical data to be verified by future detailed experimental investigations.

Own contribution to the paper:

The bioinformatics part of the paper (analysis and comparison of DNA sequences) was carried out and the results integrated with the chemical data in extensive discussions with the other authors.

33 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

PAPER www.rsc.org/obc | Organic & Biomolecular Chemistry

Spontaneity in the patellamide biosynthetic pathway

Bruce F. Milne,a Paul F. Long,b Antonio Starcevic,c Daslav Hranuelic and Marcel Jaspars*d

Received 9th November 2005, Accepted 9th December 2005 First published as an Advance Article on the web 9th January 2006 DOI: 10.1039/b515938e

Post-translationally modified ribosomal peptides are unusual natural products and many have potent biological activity. The biosynthetic processes involved in their formation have been delineated for some, but the patellamides represent a unique group of these metabolites with a combination of a macrocycle, small heterocycles and D-stereocentres. The genes encoding for the patellamides show very low homology to known biosynthetic genes and there appear to be no explicit genes for the macrocyclisation and epimerisation steps. Using a combination of literature data and large-scale molecular dynamics calculations with explicit solvent, we propose that the macrocyclisation and epimerisation steps are spontaneous and interdependent and a feature of the structure of the linear peptide. Our study suggests the steps in the biosynthetic route are heterocyclisation, macrocyclisation, followed by epimerisation and finally dehydrogenation. This study is presented as testable hypothesis based on literature and theoretical data to be verified by future detailed experimental investigations.

Introduction of peptides which includes the lantibiotics8 and microcins,4 as they contain a unique combination of heterocycles (cf microcin B17),4 a 9 Marine natural products show promise as candidates in many macrocycle (cf microcin J25) and D stereocentres. In addition, the therapeutic areas, but the issue of a viable economic supply same gene encodes two different products. The gene cluster shows 1 has stalled the development of many. Chemical synthesis is a extremely low sequence homology to genes of known function,6 solution in some cases, but biotechnological approaches may indicating that the pathway might contain novel enzymes. In be advocated in others. The patellamides (Fig. 1) are a family addition, the heterocyclisation of dipeptides (XC, XT, XS) to of highly conserved thiazole and oxazoline containing cyclic give thiazole and oxazoline rings appears to be more tolerant to octapeptides isolated from the Indo-Pacific seasquirt Lissoclinum changes at X10 than that observed in microcin B17, indicating that patella. The patellamides display a variety of biological activities this process might be exploitable for biotechnological applications 2 including cytotoxicity and as selective antagonists for reversing such as the combinatorial biosynthesis of novel bioactive products the multidrug resistant CEM/VLB100 human leukemic cell line and semi-synthetics. 3 towards vinblastine, colchicine and adriamycin treatment. The In brief, the biosynthesis of such post-translationally modified patellamides are part of a family of bioactive thiazole, thiazoline, ribosomal peptides occurs by the tailoring of the pre-propeptide oxazole and oxazoline containing natural products including coding sequence by the post-translational modification machinery, 4 bleomycin, thiostrepton and diazonamide A. The biological which recognises the leader sequence and start/stop sequences activities of this family of can be attributed to the (Scheme 1). The gene cluster sequenced by Schmidt et al. conformational constraints imposed by the heterocycles and (Scheme 1)6 allows ascribing of function to some of the genes, 4 their ability to bind metals or intercalate into DNA. L. patella but raises many questions. In particular, patB, patC and patF contains an as yet uncultured primitive photosynthetic prokaryote show low or no similarity to proteins of known function. Some Prochloron didemni, which is closely related to the cyanobacteria, of the biosynthetic steps can be rationalised using the putative but differs in the fact that it possesses both chlorophylls a and b, function of the other genes (Scheme 1). It is suggested that patD2 5 lacks phycobilins and has plant-like thylakoids. is involved in the heterocyclisation to form the oxazoline and Recent work has indicated that the patellamides are un- thiazoline rings, the latter of which is oxidised to the thiazole precedented examples of post-translationally modified ribosomal by patG1. Cleavage of the mature patE could occur under the 6 7 peptides, produced by the Prochloron didemni symbiont. The influence of patG2 or patA, followed by adenylation by patD1 and patellamides are the most complex examples of this unusual family macrocyclisation. One step that is unclear is the epimerisation of the stereocentres adjacent to the thiazole rings, although synthetic a Centro de Estudos de Qu´ımica Organica,ˆ Fitoqu´ımica e Farmacologia, studies have shown that this type of system can self-organise into Faculdade de Farmacia,´ Universidade do Porto, Rua An´ıbal Cunha 164, 11 4050-047, Porto, Portugal the correct epimer at the thiazoline stage. Macrocyclisation of bSchool of Pharmacy, University of London, 29/39 Brunswick Square, the mature peptide could occur under the influence of one of the London, UK WC1N 1AX genes of unassigned function, or the linear mature octapeptide may c Faculty of Food Technology and Biotechnology, University of Zagreb, self-organise to allow macrocyclisation with simple activation of Pierottijeva 6, 10000, Zagreb, Croatia the C-terminus. In this paper we present literature and theoretical dMarine Natural Products Laboratory, Department of Chemistry, Uni- versity of Aberdeen, Old Aberdeen, Scotland, UK AB24 3UE. E-mail: data that suggests that the epimerisation and macrocyclisation [email protected]; Fax: +44 1224 272 921 processes may be spontaneous and interdependent.

This journal is © The Royal Society of Chemistry 2006 Org. Biomol. Chem., 2006, 4, 631–638 | 631 34 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Fig. 1 Structures of the known patellamides from Lissoclinum patella and their peptide sequences.

Scheme 1 The pat gene cluster, the patE sequence encoding for 3 (upper) and 1 (lower) and suggested biosynthetic pathway for 1. Italic = leader sequence. Bold = start, stop and start/stop cyclisation sequences. Underline = patellamide coding sequences. ‘Start’ and ‘stop’ peptide sequences in the prepeptide indicate the limits between which the tailoring enzymes must act.

Results and discussion oxidase/subtilisin-like protease and the YcaO from YcaO-like family found in the patD adenylation/heterocyclisation protein Repeating essentially the autoannotation of Schmidt et al.,6 a (identified by match to protein families HMM PF00082 and search of the whole Pfam HMM library with each protein HMM PF02624). Other genes (beside patA, patD,andpatG) sequence of seven Prochloron didemni genes did not show the showed no hits at all or scores just above the threshold that presence of additional adenylation, epimerisation, racemisation cannot be considered reliable. For example, patF shows very weak or cyclisation domains other than those shown in the sequence homology to propeptide C1 and SpoVT/AbrB like domain, and description. The highest scores and most relevant results were there might be a nitroreductase in patG, but also with a very weak obtained with peptidase S8 from the subtilase family found in signal (identified by match to protein families HMM PF08127, patA subtilisin-like protein and as a part of patG thiazoline HMM PF04014 and HMM PF00881, respectively). PatB, patC

632 | Org. Biomol. Chem., 2006, 4, 631–638 This journal is © The Royal Society of Chemistry 2006 35 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

and patE did not score above the threshold in Pfam. In addition, incorporation into the patellamides, as the gene cluster suggests when profiles of NRPS adenylation, cyclisation and epimerisation that L-amino acids only are incorporated into the patellamides, domains were generated12 for a more detailed annotation of such that epimerisation must occur at a later stage. Epimerases in individual Prochloron didemni genes,6 no significant hits were non-ribosomal peptide synthetases act by de- and re-protonating detected. the a-carbon to give an equilibrium mixture of both epimers.13 The Further analysis indicates that patA contains another unknown downstream domain then selects the D-epimer for condensation domain. The peptidase S8 is localised in the first half of the to form a peptide bond. In the patellamide biosynthesis the peptide (amino acids 1–254), whereas the second half is still linear peptide is formed first followed subsequently by tailoring uncharacterised. PatD, which is postulated to have adenyla- steps. If a non-ribosomal peptide synthetase epimerase domain tion/heterocyclisation activity, shows a well-defined YcaO-like were present in the patellamide gene cluster this would result domain in the second half of the protein leaving space for another in a mixture of stereoisomers as no subsequent selectivity step domain. Therefore it is likely that patD also contains two domains, is possible. This phenomenon is not observed suggesting that a one of which is totally unknown with no homology to anything different epimerisation process operates. in the database. For patF one can conclude that it contains two The spontaneous epimerisation hypothesis is proposed on the domains because the hits (identified by matches to protein families basis of experiments performed on related L. patella compounds, HMM PF08127 and HMM PF04014) do not overlap and expand the lissoclinamides, which contain thiazolines as part of a cyclic throughout the peptide. One can only speculate about the nature heptapeptide.11 In these, the a-carbon preceding a thiazoline is of these domains because the E-values are simply too weak. PatG extremely labile to epimerisation due to the adjacent imine bond is the largest polypeptide in the cluster and probably contains (Scheme 2). In lissoclinamide 7, it was observed that if the four domains. The only one that can be certainly identified is the incorrect (L) stereochemistry was incorporated into the compound peptidase (peptidase S8 from the subtilase family) as the third at the Val-(thiazoline) a-carbon via chemical synthesis, that under domain (amino acids 522–837). The second domain might be a treatment with pyridine this centre would epimerise to give the dehydrogenase localised around amino acids 276–480, while the naturally observed epimer of lissoclinamide 7 (Scheme 3). first and fourth domains are localised around amino acids 1–270 and 840–1100. We will analyse the biosynthetic pathway (Scheme 1) in detail. There are two possible routes by which heterocyclisation may occur, cyclodehydration followed by oxidation (dehydrogenation) or oxidation followed by cyclodehydration. The former has been observed for the microcin B17 system, whereas the latter is relatively rare.4 In the case of the patellamides heterocycles at different oxidation levels, oxazolines and thiazoles, are incorpo- rated suggesting that cyclodehydration occurs first to give the oxazoline and thiazoline and that the latter is then oxidised to the thiazole by patG2 and/or patA.6 However, in the microcin B17 Scheme 3 Equilibration of unnatural epimer of lissoclinamide 7 to the system,4 thesameenzymes(McbB, C, D) process GC and GS natural epimer at the centre marked with a * occurs in the presence of 5–10 to thiazole and oxazole respectively. Thiazolines are inherently equivalents pyridine at 60 ◦C. more reactive than oxazolines suggesting that the oxidation of thiazoline to thiazole in the patellamide biosynthesis may occur spontaneously, but two other scenarios can be proposed for the In the case of the patellamides, we suggest that this process lack of further oxidation of the oxazoline to oxazole. In the could occur at the thiazoline stage whilst still incorporated into the first, kinetic release occurs before the oxazoline is oxidised to the 71 aa pre-propeptide, prior to oxidation to the thiazole, which fixes oxazole, and in the second it is proposed that the biosynthetic the stereochemistry and subsequent macrocyclisation (Scheme 4, enzymes lack terminal dehydrogenase activity. The presence of Pathway 2). An alternative scenario is that the macrocyclised patel- a putative thiazoline oxidase in patG1 suggests that the kinetic lamides are biosynthesised containing L-thiazolines, which then release scenario may be most likely. It is uncertain at which stage epimerise into the observed (D) geometry followed by enzymic or of the biosynthesis the oxidation of thiazoline to thiazole occurs. spontaneous oxidation (Scheme 4, Pathway 1). Enzymic oxidation The initial sequence analysis6 indicates that in the absence of an of the thiazolines to thiazoles would occur under the influence of epimerase in the pat gene cluster this process may either be under patG1, although spontaneous oxidation of thiazolines to thiazoles the control of a novel enzyme or be spontaneous. It is unlikely to has often been proposed for the lissoclinamides, in which identical 14 be catalysed by a separate racemase producing D-amino acids for compounds are observed differing only in oxidation states. An

Scheme 2 Proposed mechanism by which epimerisation at the a-carbon adjacent to the thiazole occurs.

This journal is © The Royal Society of Chemistry 2006 Org. Biomol. Chem., 2006, 4, 631–638 | 633 36 634 | 2006, Chem., Biomol. Org.

Doctoral dissertation. 631–638 4, Starcevic Kaiserslautern, A. In silico University studies 37 of of modular Kaiserslautern, biosynthetic hsjunlis journal This Dept. clusters. of Genetics, © h oa oit fCeity2006 Chemistry of Society Royal The January 2009

Scheme 4 The three pathways (1–3) considered for the spontaneous macrocyclisation/epimerisation hypothesis and subjected to molecular dynamics simulations. Pathway 4 is deemed unlikely on biosynthetic grounds but has been included in the simulations for completeness. Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

intermediate route is shown by pathway 3 in Scheme 4, in which Fig. 2. For clarification, the actual distance measured during these epimerisation occurs first to generate D-thiazolines followed simulations is represented in Fig. 3. by macrocyclisation and oxidation to the thiazoles. A fourth pathway can be proposed (pathway 4 in Scheme 4), in which the epimerisation occurs after the formation of the thiazoles, with macrocyclisation occurring prior or subsequent to this step. Although this pathway has been included in our molecular dynamics simulations, it is felt that this is biosynthetically extremely unlikely. The initial suggestion6 was that epimerisation occurs during the maturation of the pre-propeptide and prior to proteolysis and macrocyclisation (Scheme 4, Pathway 2). Schmidt et al. suggest that the mature propeptide containing the oxazoline and thiazole rings and two D-stereocentres is cleaved from the propeptide under the influence of patG2 or patA (Scheme 1).6 PatD1 could adenylate the linear peptide followed by macrocyclisation, which could be spontaneous, or under enzymic control. Spontaneous macrocyclisation has also been proposed as a possibility for the lantibiotics, related post- translationally modified ribosomal peptides.8 Studies on several lantibiotics using synthetic precursors have shown that sponta- neous macrocyclisation is possible, and that in some cases the naturally observed stereochemistry is obtained.8 We propose here that this process in the patellamides is spontaneous as the linear octapeptide containing heterocycles is capable of pre-organising The simulations for the expected route via pathway 2 (Asc D-Thz itself without outside influence into a conformation suitable in Fig. 2) in which epimerisation is followed by dehydrogenation for macrocyclisation. This effect is due to the conformational and macrocyclisation show that the N and C termini never get constraints induced by the thiazoline or thiazole and oxazoline within reasonable bonding distance. Therefore, if this is the actual rings. This effect of pre-organising of cyclic peptides by including biosynthetic route, although epimerisation may occur sponta- conformational constraints such as heterocycles has been observed neously, the macrocyclisation must be under enzymic control. This in the synthesis of cyclic peptides by Jolliffe and co-workers, process would then differ from macrocyclisation in non-ribosomal and was found to increase macrocyclisation yields up to 99%, peptide synthetases, in which an enzyme-bound peptide chain is even at high substrate concentrations.15 Chemical syntheses of the cleaved by a terminal thioesterase and cyclised under control of 18 patellamides do not utilise this effect because the oxazoline rings the same enzyme. In pathway 2 a free fully tailored Asc D-Thz are formed last.16 chain would be expected to complex to an enzyme which would For these reasons we decided to simulate the behaviour of the activate it and enforce a conformation in which macrocyclisation putative linear peptides represented in Scheme 4 pathways 1– is possible. 4 using molecular dynamics simulations to determine which of Similarly, pathway 3 (Asc D-Thn in Fig. 2) in which epimerisa- these pathways is the most probable. We have investigated the tion is followed by macrocyclisation and dehydrogenation also conformational preferences of the patellamides previously under suggests that macrocycle closure will rarely occur. This then different solvent conditions using a mixture of circular dichroism, suggests that the epimerisation step must occur as a later step NOE-restrained molecular dynamics and unrestrained molecular in the biosynthetic process, as both pathways 1 and 4 (Asc L-Thn 17 dynamics calculations. and Asc L-Thz in Fig. 2) show trajectories in which the C and N In this study the four possible linear peptides encompassing termini are frequently within bonding distance. Pathway 4 (Asc the thiazole and thiazoline oxidation states and L and D stere- L-Thz in Fig. 2) was ruled out on biosynthetic grounds above ochemistries at the a-centre adjacent to the thiazole/thiazoline as this would require an epimerisation of a non-labile a-centre were constructed for molecular dynamics in explicit water (Asc adjacent to a thiazole. In the absence of an epimerase domain L-Thn, Asc D-Thn, Asc L-Thz, Asc D-Thz). Ascidiacyclamide (8) and evidence suggesting that spontaneous epimerisation at the a- sidechains were chosen for simplicity and the macrocycle junction centre adjacent to thiazoline is facile this suggests that pathway 4 was created where indicated by the patE coding sequences. is extremely unlikely. The remaining route, pathway 1, (Asc L-Thn Zwitterionic forms present at the expected physiological pH were in Fig. 2) in which macrocyclisation occurs on the linear peptide used, as well as relevant concentrations of NaCl. Adenylate groups with thiazolines and L stereochemistry at the a-centres adjacent were not included, as the eventual distance between the N and C to the thiazolines, followed by spontaneous epimerisation and termini (Fig. 2) would determine whether cyclisation was possible, finally enzymic or spontaneous oxidation of the thiazolines to and not the nature of the activating group. After generating thiazoles is therefore deemed the most likely. The trajectories from starting conformations using simulated annealing followed by the dynamics simulations indicate that the N and C termini are energy minimisation, six simulations of 1000 ps with a 2 fs within bonding distance for a greater proportion of the time. If timestep were executed at 300 K for each of the four linear adenylation of the linear Asc L-Thn occurs under the influence peptides. The distance between the N and C termini were extracted of patD1, then macrocyclisation will occur readily. This pathway from the trajectories for each of these simulations and plotted in is in keeping with the macrocyclisation experiments by Jolliffe

This journal is © The Royal Society of Chemistry 2006 Org. Biomol. Chem., 2006, 4, 631–638 | 635 38 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Fig. 2 Terminal N–C distances monitored during a total of six 1000 ps dynamics simulations for all geometries of ascidiacyclamide considered in this study.

between the two types of heterocycles present. If the process is enzymic at this late stage, then it may be that the oxazolines cannot gain access to the oxidase active site.

Conclusions

The inability of the preferred candidate, Asc D-Thz to close easily suggests that a macrocyclase may be involved, but in the absence of such an enzyme in the sequence analysis this indicates that the enzyme may be very novel, or that the process via Fig. 3 Actual distance measured during the molecular dynamics simula- is spontaneous and occurs a different pathway as suggested tions represented in Fig. 2. by the findings discussed above. Our study is suggestive that pathway 1 in Scheme 4 may be the route via which the patellamides are biosynthesised. A testable hypothesis is therefore proposed mentioned above,15 and Wipf’s observation that a-centres adjacent which can be verified by future detailed biosynthetic studies. to thiazolines in these macrocycles can spontaneously self organise The function of the genes in the patellamide gene cluster will into the lowest energy epimers (Scheme 3).11 This then suggests only be fully understood after these investigations, but it is that the oxidation of thiazoline to thiazole is the final step in the postulated that, in parallel with the lantibiotics8 and microcins,4 biosynthetic process, whether spontaneous or enzymic. The fact several of these might be involved in metabolite export and host that the oxazolines are not oxidised to oxazoles in the patellamides immunity. may be due to the fact that this oxidation occurs at such a late stage Speculation as to the biosynthetic origin of unusual metabolites in the biosynthesis. If the oxidation is spontaneous, then chemical has been used successfully to explain observed stereochemistries reactivity differences may explain the difference in oxidation states and product distributions. The most notable example is the

636 | Org. Biomol. Chem., 2006, 4, 631–638 This journal is © The Royal Society of Chemistry 2006 39 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

postulated electrocyclic reactions leading to the endiandric acids,19 as starting configurations for the production runs. These runs were which were later confirmed by the synthesis of the putative precur- performed at 300 K and lasted for 1000 ps. A 2 fs timestep was used sor which gave identical product distribution and stereochemistry and Berendsen temperature coupling and PME electrostatics were to that observed in nature.20 Similarly, the biogenetic hypothesis employed. The evolution of the distance between the N-terminal towards the manzamine alkaloids21 predicted the existence of the nitrogen and the C-terminal carbon was extracted from the final related xestocyclamine/madangamine alkaloids.22 Speculation on trajectories using the gOpenMol software.39 the biosynthetic origin of alga-derived brominated terpenoids has recently been shown to be correct.23 As can be seen, biogenetic hypotheses of the type in this paper has helped advance our Acknowledgements understanding of biosynthetic processes in the past24 and has 25 MJ and PFL thank the Leverhulme Trust for funding (F/00 assisted in the design of biomimetic syntheses. 152/N). Bioinformatics work was supported by the grant 058008 (to DH) from the Ministry of Science, Education and Sports, Experimental Republic of Croatia. All molecular dynamics calculations were performed under openMosix on the SEERAD funded 64- Sequence analysis node RRI/BioSS Beowulf cluster at the Rowett Research In-

26 stitute, Aberdeen, Scotland with the help of Dr Tony Travis. The entire Pfam HMM library was downloaded from the BFM is the recipient of post-doctoral research fellowship Pfam’s FTP site. The library was searched locally using the 27 SFRH/BPD/17830/2004 from the Portuguese Ministry for Sci- latest HMMER hidden Markov model software. The search was ence, Technology and Higher Education (Foundation for Science done using fastA files containing protein translations of all seven and Technology). individual Prochloron didemni genes6 taken from the GenBank’s sequence AY986476. To confirm the results from the search of the Pfam HMM References library, profiles of all adenylation, cyclisation and epimerisation 28 1 D. J. Newman and G. M. Cragg, J. Nat. Prod., 2004, 67, 1216. domains from NRPS’s were prepared. FastA protein sequences 2 B. M. Degnan, C. J. Hawkins, M. F.Lavin, E. J. McCaffrey, D. L. Parry, 12 of domains were obtained from the NRPSDB database. Multiple A. L. v. d. Brenk and D. J. Watters, J. Med. Chem., 1989, 32, 1349. alignments of domains were made using ClustalW.29 Three profiles 3 A. B. Williams and R. S. Jacobs, Cancer Lett., 1993, 71, 97. were built from multiple alignments of domains by the preparation 4 R. Sinha Roy, A. M. Gehring, J. C. Milne, P.J. Belshaw and C. T. Walsh, Nat. Prod. Rep., 1999, 16, 249. of a small library that was used in search. 5R.LewinandL.Chang,Prochloron: A Microbial Enigma, Chapman and Hall, New York, 1989. 6 E. Schmidt, J. Nelson, D. Rasko, S. Sudek, J. Eisen, M. Haygood and Molecular dynamics simulations J. Ravel, Proc. Natl. Acad. Sci. USA, 2005, 102, 7315. 7 P. Long, W. Dunlap, C. Battershill and M. Jaspars, ChemBioChem, Molecular dynamics (MD) calculations were carried out with 2005, 6, 1760. the GROMACS v3.2.1 package.30,31 The modified GROMOS-87 8 C. Chatterjee, M. Paul, L. Xie and W. A. van der Donk, Chem. Rev., united-atom force field distributed with GROMACS was used 2005, 105, 633. 32–35 9 S. Rebuffat, A. Blond, D. Destoumiex-Garzon, C. Goulard and J. in all simulations. Aqueous solvation was modelled using the Peduzzi, Curr. Protein Pept. Sci., 2004, 5, 383. flexible form of the enhanced simple point charge (SPC/E) water 10 R. S. Roy, P. J. Belshaw and C. T. Walsh, Biochemistry, 1998, 37, 4125. model.36,37 A 125 nm3 cubic water box was used for solvation of 11 P. Wipf, P. C. Fritch, S. J. Geib and A. M. Sefler, J. Am. Chem. Soc., each peptide, requiring ∼4200 waters per peptide. In order to better 1998, 120, 4105. Nucleic Acids + − 12 M. Z. Ansari, G. Yadav, R. S. Gokhale and D. Mohanty, simulate the physiological environment Na and Cl were Res., 2004, 32, W405. addedwiththegenion program (supplied as part of GROMACS) 13 D. B. Stein, U. Linne and M. A. Marahiel, FEBS J., 2005, 272, 4506. so as to produce a concentration of 35 g l−1, with minimum 14 L. A. Morris, B. F. Milne, M. Jaspars, J. J. Kettenes van den Bosch, K. Versluis, A. J. R. Heck, S. M. Kelly and N. C. Price, Tetrahedron, 2001, force calculations being used to determine which of the solvent 57, 3199. molecules were to be replaced. The peptides were modelled in their 15 D. Skropeta, K. A. Jolliffe and P.Turner, J. Org. Chem., 2004, 69, 8804. zwitterionic forms and the PRODRG server at Dundee Univer- 16 Y. Hamada, M. Shibata and T. Shioiri, Tetrahedron Lett., 1985, 26, sity (http://davapc1.bioch.dundee.ac.uk/programs/prodrg/) was 6501. 38 17 B. F. Milne, L. A. Morris, M. Jaspars and G. S. Thompson, J. Chem. used for generation of the required force field topology files. Soc., Perkin Trans. 2, 2002, 1076. In order to generate starting configurations for the simulations 18 H. D. Mootz, D. Scharzer and H. A. Marahiel, ChemBioChem, 2002, each peptide–solvent system was subjected to 6 cycles of simulated 3, 490. J. Chem. Soc., Chem. annealing. The temperature was cycled from 50 K up to 500 K 19 W.Bandanarayake, J. Banfield and D. S. C. Black, Commun., 1980, 902. over 250 ps and back down to 50 K over 50 ps so as to 20 K. Nicolaou, N. Petasis and R. Zipkin, J. Am. Chem. Soc., 1982, 104, allow reasonable sampling of conformational space followed 5560. by quenching of favoured conformations. Temperature coupling 21 J. E. Baldwin and R. C. Whitehead, Tetrahedron Lett., 1992, 33, 2059. 22 J. Rodriguez and P. Crews, Tetrahedron Lett., 1994, 35, 4719. within each system was performed using the Berendsen method 23 A. Butler and J. N. Carter-Franklin, Nat. Prod. Rep., 2004, 21, 180. and electrostatics were treated using the particle–mesh Ewald 24 R. Thomas, Nat. Prod. Rep., 2004, 21, 224. (PME) approach. The time step employed was 2 fs. The resulting 25 K. Nicolaou, T. Montagnon and S. Snyder, Chem. Commun., 2003, trajectories were manually sampled every 300 ps using the ngmx 551. 26 A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. R. Eddy, GROMACS trajectory viewer. The 6 samples for each derivative S. Griffiths-Jones, K. L. Howe, M. Marshall and E. L. Sonnhammer, were then subjected to energy minimisation and subsequently used Nucleic Acids Res., 2002, 30, 276.

This journal is © The Royal Society of Chemistry 2006 Org. Biomol. Chem., 2006, 4, 631–638 | 637 40 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

27 HMMER-Biological sequence analysis with profile hidden Markov 34 D. van der Spoel, A. R. van Buuren, D. P. Tieleman and H. J. C. models Version 2.3.2, HHMI/Washington University School of Berendsen, J. Biomol. NMR, 1996, 8, 229. Medicine, 1992–2003. 35 W. F. van Gunsteren and H. J. C. Berendsen, GROMOS-87 Manual, 28 S. R. Eddy, Bioinformatics, 1998, 14, 755. 1987. 29 D. Higgins, J. Thompson, T. Gibson, J. D. Thompson, D. G. Higgins 36 H. J. C. Berendsen, J. R. Grigera and T. P. Straatsma, J. Phys. Chem., and T. J. Gibson, Nucleic Acids Res., 1994, 22, 4673. 1987, 91, 6269. 30 H. J. C. Berendsen, D. van der Spoel and R. G. van Drunen, Comput. 37 H. J. C. Berendsen, J. P. M. Postma, W. F. van Gunsteren and J. Phys. Commun., 1995, 91, 43. Hermans, Intermolecular Forces, ed. B. Pullman, Reidel, Dordrecht, 31 E. Lindahl, B. Hess and D. van der Spoel, J. Mol. Model, 2001, 7, 306. 1981. 32 X. Daura, B. Oliva, E. Querol, F. X. Aviles and O. Tapia, Proteins: 38 A. W. Schuettelkopf and D. M. F. van Aalten, Acta Crystallogr., Sect. Struct., Funct., Genet., 1997, 25, 89. D, 2004, 60, 1355. 33 A. R. van Buuren, S. J. Marrink and H. J. C. Berendsen, J. Phys. Chem., 39 D. L. Bergman, L. Laaksonen and A. Laaksonen, J. Mol. Graphics, 1996, 97, 9206. 1997, 15, 301.

638 | Org. Biomol. Chem., 2006, 4, 631–638 This journal is © The Royal Society of Chemistry 2006 41 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

2.4 Enzymes of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins

Antonio Starcevic, Shamima Akthar, Walter C. Dunlap, J. Malcolm Shick, Daslav Hranueli, John Cullum and Paul F. Long. Proc. Natl. Acad. Sci. USA, 105, 2533-2537, 2008.

Abstract:

The shikimic acid pathway is responsible for the biosynthesis of many aromatic compounds by a broad range of organisms including bacteria, fungi, plants and some protozoans. Animals are considered to lack this pathway as evinced by their dietary requirement for shikimate-derived, aromatic amino acids. We challenge the universality of this traditional view in this first report of genes encoding enzymes for the shikimate pathway in an animal, the starlet sea anemone Nematostella vectensis. Molecular evidence establishes, for the first time, horizontal transfer of ancestral genes of the shikimic acid pathway into the N. vectensis genome from both bacterial and eukaryotic (dinoflagellate) donors. Bioinformatic analysis also reveals four genes that are closely related to those of Tenacibaculum sp. MED152, raising speculation for the existence of a previously unsuspected bacterial symbiont. Indeed, the genome of the holobiont (i.e., the entity consisting of the host and its symbionts) comprises a high content of Tenacibaculum-like gene orthologs, including a 16S rRNA sequence that establishes the phylogenetic position of this symbiont to be within the family Flavobacteriaceae. These results provide a complementary view for the biogenesis of shikimate-related metabolites in marine Cnidaria as a “shared metabolic adaptation” between the partners.

Own contribution to the paper:

All the HMMER and BLAST analyses were carried out and the bioinformatic analyses of the undergraduate student Ms. Akthar were supervised.

42 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Enzymes of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins

Antonio Starcevic*†, Shamima Akthar‡, Walter C. Dunlap§, J. Malcolm Shick¶, Daslav Hranueli*, John Cullum†, and Paul F. Long‡ʈ

*Section for Bioinformatics, Department of Biochemical Engineering, Faculty of Food Technology and Biotechnology, University of Zagreb, Pierottijeva 6, 10000 Zagreb, Croatia; †Department of Genetics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany; ‡School of Pharmacy, University of London, 29/39 Brunswick Square, London WC1N 1AX, United Kingdom; §Australian Institute of Marine Science, PMB 3 Townsville MC, Townsville 4810, Queensland, Australia; and ¶School of Marine Sciences, University of Maine, 5751 Murray Hall, Orono, ME 04469-5751

Edited by Lynn Margulis, University of Massachusetts, Amherst, MA, and approved December 19, 2007 (received for review August 7, 2007) The shikimic acid pathway is responsible for the biosynthesis of shikimate pathway to synthesize ‘‘essential’’ aromatic com- many aromatic compounds by a broad range of organisms, includ- pounds (5, 9, 10). ing bacteria, fungi, plants, and some protozoans. Animals are Fragmentary evidence suggests an exception to this traditional considered to lack this pathway, as evinced by their dietary view. For example, nonsymbiotic scleractinian corals reportedly requirement for shikimate-derived aromatic amino acids. We chal- can synthesize aromatic amino acids (11), and aposymbiotic lenge the universality of this traditional view in this report of genes specimens (temporarily lacking zooxanthellae owing to experi- encoding enzymes for the shikimate pathway in an animal, the mental or special environmental circumstances) of the sea starlet sea anemone Nematostella vectensis. Molecular evidence anemone Aiptasia pulchella can produce other amino acids establishes horizontal transfer of ancestral genes of the shikimic deemed essential for metazoans (albeit not aromatic amino acids acid pathway into the N. vectensis genome from both bacterial and produced in postchorismate ; ref. 12). Aposymbiotic eukaryotic (dinoflagellate) donors. Bioinformatic analysis also re- specimens of the sea anemone Anthopleura elegantissima contain GENETICS veals four genes that are closely related to those of Tenacibaculum the same complement and concentration of MAAs derived from sp. MED152, raising speculation for the existence of a previously the shikimate pathway as found in zooxanthellate conspecifics unsuspected bacterial symbiont. Indeed, the genome of the holo- (9), a situation that is not altered by controlled diets (13). biont (i.e., the entity consisting of the host and its symbionts) Moreover, the zooxanthellae (Symbiodinium muscatinei) freshly comprises a high content of Tenacibaculum-like gene orthologs, isolated from A. elegantissima do not contain MAAs nor do these including a 16S rRNA sequence that establishes the phylogenetic algae produce MAAs in pure culture (14). These facts suggest position of this associate to be within the family Flavobacteriaceae. that neither the zooxanthellae nor the diet are the immediate These results provide a complementary view for the biogenesis of source of MAAs in these sea anemones. Symbiotic specimens of shikimate-related metabolites in marine Cnidaria as a ‘‘shared A. elegantissima may naturally harbor taxonomically different metabolic adaptation’’ between the partners. algae (chlorophytes and dinoflagellates) under different envi- ronmental circumstances, yet animals having different symbi- symbiosis ͉ Tenacibaculum ͉ Cnidaria onts contain the same complement of MAAs (9); it is improb- able that such phylogenetically distant algae would produce exactly the same subset of MAAs from among an identified suite he concentrations of inorganic nutrients and planktonic food Ͼ are often low in certain marine environments, where many of 20 such compounds, so the endosymbionts may not account T for the MAAs in this case, either. Finally, some symbiotic corals invertebrates compensate for such oligotrophic conditions by harbor phylotypes of zooxanthellae that do not produce MAAs adopting phototrophic symbioses with microorganisms, espe- in culture, yet the symbioses do contain MAAs (15, 16), which cially dinoflagellates, other unicellular algae, and photosynthetic again suggests an extra-algal origin of the MAAs or a regulated bacteria. The dinoflagellate endosymbionts of marine inverte- metabolism unique to the holobiont. We have chosen the brates, colloquially known as ‘‘zooxanthellae,’’ typically belong shikimic acid pathway to interrogate the metabolic adaptations to the genus Symbiodinium. Such zooxanthellae are particularly in marine invertebrates, which to date are poorly defined at the common in association with anthozoan cnidarians, especially genetic level (4, 17, 18). corals and sea anemones. In cnidarian photoautotrophic sym- In the absence of genomic data for symbiotic anthozoans and bioses, the animal host receives organic carbon, including car- their zooxanthellae, we have examined Nematostella vectensis, bohydrates, lipids, and amino acids, from its endosymbionts, the burrowing ‘‘starlet’’ sea anemone found in estuaries and salt whereas the photobionts themselves are fertilized by the recy- marshes along the North American Atlantic and Pacific coasts cling of essential nitrogen, phosphorus, and sulfur wastes pro- and on the southeast coast of the United Kingdom, and which duced from catabolism in the host and by respiratory CO2 to has become a model system in developmental biology (19). This support algal (1–3). Symbiotic interactions go far basal metazoan was the first to have its genome sequenced (20), beyond the cycling of carbon and nutrients and must also include other metabolic adaptations for survival in marine ecosystems (4). For example, UV-absorbing mycosporine-like amino acids Author contributions: W.C.D., J.M.S., D.H., J.C., and P.F.L. designed research; A.S., S.A., (MAAs) that protect against the direct and indirect damaging W.C.D., and J.C. performed research; A.S., S.A., W.C.D., J.M.S., D.H., J.C., and P.F.L. analyzed effects of solar UV radiation have been isolated from diverse data; and A.S., W.C.D., J.M.S., D.H., J.C., and P.F.L. wrote the paper. free-living algae and marine invertebrate-algal symbioses, in- The authors declare no conflict of interest. cluding zooxanthellate cnidarians. MAAs are putatively synthe- This article is a PNAS Direct Submission. sized via a branch of the shikimic acid pathway at 3-dehydro- ʈTo whom correspondence should be addressed. E-mail: [email protected]. quinate (5–8). Therefore, the expectation is that MAAs in This article contains supporting information online at www.pnas.org/cgi/content/full/ phototrophic symbioses are produced by the microbial partner, 0707388105/DC1. or obtained from the diet, because metazoans reputedly lack the © 2008 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707388105 PNAS ͉ February 19, 2008 ͉ vol. 105 ͉ no. 7 ͉ 2533–2537 43 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

which provided us with the opportunity to mine the genome (21) of a nominally symbiont-free cnidarian for evidence of genes encoding enzymes of the shikimic acid pathway. Results and Discussion The anabolic shikimic acid pathway has seven steps [supporting information (SI) Scheme 1], which may be catalyzed by seven different polypeptides or by fewer multifunctional polypeptides (22). The enzymes for five of the biosynthetic steps are homol- ogous in all organisms that possess the pathway. For two of the steps, there are two different enzymes known for each, and every organism expressing the pathway has a homologue to one of these enzymes. Furthermore, there are two additional consid- erations in detecting genes encoding shikimic acid pathway Fig. 1. Phylogenetic tree showing the relationship of the predicted protein enzymes in N. vectensis:(i) the evolutionary origin of the genes sequence of the N. vectensis aroA-like gene to the predicted murA protein would be uncertain so that the sequences might have diverged sequences of the seven best hits in a BLAST analysis and to those in E. coli and considerably from any comparison sequences used and (ii) the Tenacibaculum. Distances were calculated from a CLUSTAL W alignment using genomic sequence may contain introns. the Jones-Taylor-Thornton matrix, and the tree was constructed by using the To obtain the greatest sensitivity of interrogation, the neighbor-joining algorithm in programs of the PHYLIP package (version 3.63). HMMER suite of programs (23) was used to search for con- The distance is proportion of amino acid substitutions. sensus protein sequences by using hidden Markov model pro- files. This method provides greater weight to evolutionary used for a BLAST search (SI Dataset 3), the closest fit was with conserved residues, and local profiles reveal the protein frag- the dinoflagellate Oxyrrhis marina (66% amino acid sequence ments in coding exons. The genome sequence of N. vectensis was identity). In this dinoflagellate, the aroB enzyme (3- translated in all six reading frames and searched by using nine dehydroquinate synthase) is present in the chloroplast and is profiles covering all seven enzymes of the shikimic acid pathway fused to an O-methyltransferase (25). When the complete fusion obtained from the Pfam database (24). Two alignments (‘‘hits’’) protein sequence from O. marina was used in a BLAST search were found in large scaffolds with HMMER using the aroA and against the translated DNA of N. vectensis, it was evident that a aroB profiles (SI Dataset 1). The aroA hit occurred in scaffold࿝33 fusion protein gene was also present in N. vectensis (SI Dataset (1.4 Mbp). When the predicted protein sequence was used for a 4). This gene contains five introns. When the aroB segment of the BLAST search, it aligned with the murA gene product of a gene was used to construct a phylogenetic tree with the closest variety of bacteria with Ϸ40% amino acid identity. This bacterial BLAST hits (Fig. 2), the N. vectensis sequence emerged as being gene encodes UDP-N-acetylglucosamine 1-carboxyvinyltrans- closest to those of two dinoflagellates (O. marina and Hetero- ferase (SI Dataset 2), an enzyme related to aroA (3- phosphoshikimate 1-carboxyvinyltransferase), whereas the MurA enzyme is involved in the biosynthesis of the peptidogly- can cell wall. This finding initially suggested that the aligned sequence might originate from bacterial contamination. How- ever, close examination of the HMMER results showed that the predicted protein lacked Ϸ20 conserved amino acids at the C terminal, and that the missing amino acid sequence was located Ϸ1 kb downstream in the scaffold. Visual comparison of the genomic sequence with consensus sequences for vertebrate introns revealed plausible splice sites (AGGTRA and AGG, respectively) that would produce mRNA encoding a full-length murA homologue having a close fit to the search profile. The presence of introns thus eliminates the question of bacterial contaminants or symbionts as the proximal source of this gene. The 1.4-Mbp scaffold containing the aroA-like homolog was translated in all six reading frames and scanned by using HMMER with the entire Pfam library. This process showed the presence of a variety of typical eukaryotic domains including reverse transcriptase, EGF, calcium binding EGF domain, de- fensin-like peptide, actin, and the fork-head domain, again supporting the idea that the aroA-like homolog is contained in the N. vectensis genome itself. The sequence of the predicted protein was used to construct a phylogenetic tree to compare with the closest bacterial sequences found in the BLAST search and with Tenacibaculum sp. MED152 and W3110 (Fig. 1). The aroA-like sequence of N. vectensis did not cluster with homologs from any group of bacteria tested, but showed a sequence divergence from bacterial sequences com- Fig. 2. Phylogenetic tree showing the relationship of the deduced protein parable with those of murA genes between different bacterial sequence of the aroB part of the AroB-O-methyltransferase protein of N. vectensis to homologous dinoflagellate proteins. Sequences were aligned groups. Whether this gene in N. vectensis directs biosynthesis with CLUSTALW and the tree was constructed by using the neighbor-joining of peptidoglycan or shikimate pathway intermediates is yet algorithm with distances derived from the Jones-Taylor-Thornton model (us- unknown. ing PHYLIP version 3.63). The tree was rooted by using Anabaena variabilis as The second alignment, related to aroB, was present on scaf- an out group. The distances are the proportion of amino acid substitutions, fold࿝85 (0.8 Mbp). When the predicted protein sequence was and the bootstrap values based on 100 samples are shown.

2534 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707388105 Starcevic et al. 44 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Fig. 3. Phylogenetic tree of PCNA protein sequences. The sequence of the PCNA protein of O. marina was used for a BLAST search against the translated genomic sequences of N. vectensis. The BLAST alignments were used to assemble the protein sequence from N. vectensis. The sequences from the two species were used for BLAST searches of GenBank, and a selection of the best hits for each species was used to construct a phylogenetic tree by using the neighbor-joining algorithm in programs of the PHYLIP package (version 3.63). The distance is proportion of nucleotide substitutions.

Fig. 4. Phylogenetic tree showing the relationship of the 16S rRNA gene capsa triquetra) that each possess the complete fusion gene. sequence found in the N. vectensis genome sequence (720-bp fragment in Again, this gene could be involved in the synthesis of precursors entry c429301624.Contig1 of StellaBase, http://evodevo.bu.edu/stellabase; SI leading to shikimate pathway-derived secondary metabolites, Dataset 8a) to the sequences of the closest type strains in Ribosomal Data Base most notably 3-dehydroquinate, the putative intermediate Project II (release 9.52; http://rdp.cme.msu.edu). Distances were calculated from a CLUSTAL W alignment using the F84 model, and the tree was con- branchpoint to MAA biosynthesis (5). structed as in Fig. 3. GENETICS Because endosymbiotic dinoflagellates are often associated with cnidarians, the possibility had to be considered that there was an undetected dinoflagellate contaminating the N. vectensis base to reveal related sequences. In all four cases, the best sequence. The predicted protein sequences derived from the matches were to the genes of the shikimic acid pathway in neighboring genes on either side of the aroB-like homolog Flavobacteria, having Ϸ70% amino acid identity (SI Dataset 7). [encoding a RuvB-like protein and a probable malate synthase In most cases Tenacibaculum sp. MED152, whose genome is (MS), respectively] were used for BLAST searches. The closest being sequenced (www.moore.org/marine-micro.aspx), was the alignments were with various vertebrates and with sequences best match, although a strict similarity may be influenced by from the sea urchin Strongylocentrotus purpuratus, which makes database bias for this bacterium. A fifth gene in N. vectensis it unlikely that these shikimate pathway genes in the host corresponded to the aroF-H genes of E. coli, which encode metazoan’s genome are from contamination by the genome of an isoenzymes for 3-deoxy-D-arabinoheptulosonate-7-phosphate associated dinoflagellate (SI Dataset 5). In addition, three ␤ synthase (DAHPS). However, BLAST searches showed the best putative protein sequences [ -tubulin, heat shock protein 90 hits (90% amino acid identity) to be the kdsA genes of Flavobac- (HSP-90), and proliferating cell nuclear antigen (PCNA)] from teria; these encode other isoenzymes of the DAHPS family that O. marina were used for BLAST searches against N. vectensis. are involved in lipopolysaccharide synthesis. The best hits were used to construct a phylogenetic tree, and in The high similarity of the N. vectensis gene sequences to those no case were the N. vectensis and O. marina sequences closely of the bacterial shikimate pathway could be explained by either related (Fig. 3). It must be stressed, however, that additional a recent HGT event or bacterial DNA contamination in the N. evidence is necessary to determine the assumed function of these vectensis genome sequences. The codon usage was similar to genes and for proof of their acquisition by horizontal gene transfer (HGT) in N. vectensis, particularly because cnidarians Tenacibaculum rather than N. vectensis. Two sequences were reputedly have conserved genes that they inherited from non- identified that appeared to be significant fragments of bacterial metazoan ancestors (26). Although the importance of HGT in 16S rRNA genes. One 16S rRNA sequence (985 bp; SI Dataset eukaryotic evolution remains controversial, there is independent 8a) showed closest similarity to Pseudomonas sequences. How- evidence for the occurrence of another HGT event in N. ever, as it did not belong to a scaffold containing other bacterial vectensis. Comparative genomic examination of glyoxylate cycle sequences and there were no other Pseudomonas-like genomic enzymes has revealed the likely transfer of a bifunctional iso- sequences detected, it is likely that it is derived from a sequenc- citrate lyase (ICL) and a MS, encoded by a fused ICL-MS gene ing contaminant. Because the original shotgun sequencing data from a bacterial precursor, to the N. vectensis genome (27). Our were not available to us, we could not analyze the N. vectensis findings are similar to those of others reporting evidence of gene genome by using a recently released version of the Glimmer gene transfer to freshwater cnidarian (Hydra) species from multiple annotation tool (http://cbcb.umd.edu/software/glimmer; ref. 30), ancestral eukaryotic partners (18, 28). which would have been a useful way to quantify the percentage Our genomic mining of N. vectensis revealed another surprise of the hologenome encoded on small scaffolds and likely, beyond the transfer of genes from a bacterium and a dinoflagel- therefore, to be from living bacteria. late to the cnidarian’s genome. We found seven good sequence The other 16S rRNA sequence belonged to a scaffold, which alignments corresponding to five potential genes of the shikimic also contained 23S rRNA sequences in an arrangement typical acid pathway. Among these were four very strong alignments of rRNA operons (720 bp; SI Dataset 8b), and a phylogenetic corresponding to the genes aroA, aroB, aroC, and aroE of E. coli tree of the 16S rRNA portion (Fig. 4) showed that it came from (SI Dataset 6). The predicted protein sequences of these genes a flavobacterium, but it could not be assigned to a known genus. were used in BLAST search queries (29) against the National Phylogenetic trees were also constructed for the aroA, aroB, Center for Biotechnology Information (NCBI) GenBank data- aroC, and aroE sequences, with similar results. A further con-

Starcevic et al. PNAS ͉ February 19, 2008 ͉ vol. 105 ͉ no. 7 ͉ 2535 45 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

sideration was that much of the genome of N. vectensis was the genome sequence of the bacterial endosymbiont Carsonella organized into large scaffolds, whereas these 16S rRNA frag- ruddii found in aphids was made public (37, 38). Comparison ments were present in small scaffolds from which short contigs of this genome sequence with that of another bacterial endo- were sequenced, so that only incomplete gene sequences were symbiont of aphids, Buchnera aphidicola, showed that both revealed. This result gave the first indication that these 16S genomes had undergone considerable deletion, including loss rRNA fragments might be from bacterial contamination rather of some genes encoding essential metabolic pathways. One than from genomic DNA of N. vectensis in the strict sense. The such missing pathway leading to the formation of the aromatic Tenacibaculum genome project has identified most of its genes, amino acid in C. ruddii caught our attention. and the 2,679 predicted protein sequences from the genomic According to dogma (10), precursors for this essential amino annotation were used for a BLAST search against translated N. acid should be synthesized via the shikimic acid pathway in the vectensis DNA. When a stringent expected value of Ͻ10Ϫ30 was commensal bacteria. Again, we searched global sequence used, 509 of the Tenacibaculum sequences (19%) gave positive alignments for genes encoding enzymes of the shikimic acid hits. However, a less stringent cutoff (10Ϫ10) gave 1,563 (58%) pathway in these bacterial genomes. We found one gene hits. In many of these cases, higher expected values were encoding a putative 5-enolpyruvylshikimate-3-phosphate associated with partial sequences, as the hits were in smaller phospholyase in C. ruddii (although whether this gene would scaffolds having small contigs with many bases in the scaffolds transcribe a functional product is debatable owing to the large not being determined. In fact, the aroE and kdsA genes were at number of stop codons in the sequence), and only three (those the ends of contigs so that their sequences were truncated and encoding shikimate 5-dehydrogenase, 5-enolpyruvylshikimate- lacked the last 40 or 25 aa, respectively. Although an accidental 3-phosphate phospholyase, and 5-enolpyruvylshikimate-3- contamination of the original N. vectensis template cannot be phosphate synthase) of the seven genes for the pathway were ruled out, an exciting possibility is that the sequences come from apparent in the B. aphidicola genome (SI Dataset 9). Taken a previously unsuspected flavobacterial associate similar to together with our findings for the putative Tenacibaculum-like Tenacibaculum. symbiont and its host N. vectensis, this evidence strongly suggests There is independent support for our contention that the that the loss of essential metabolic function in the endosymbiont foregoing sequences in the published genome for N. vectensis is an ongoing process of gene transfer and deletion in the may come from bacteria associated with the early develop- evolution of symbioses that could ultimately lead to extinction of mental stages of the anemone. The authors of the reported the symbiont by progressive assimilation of its genetic material Nematostella vectensis genome (20), in their supporting online into the host genome (37, 38). material (Supplement S2 in www.sciencemag.org/cgi/content/ The elucidation of ‘‘shared metabolic adaptations,’’ where full/317/5834/86/DC1), did explicitly state that they prepared the production of essential metabolites involves input by the genomic DNA from larvae to avoid contamination by the partners of a symbiosis (even if one is degenerate), will require commensals or symbionts that have been reported for the further genomic dissection of the unique organization and adults, although they gave no reference for the latter statement molecular functioning of invertebrate-microbial symbioses. regarding such associates. Despite this precaution, there are This is highlighted by our finding that two of the genes for separate findings that DNA isolates from embryos and early enzymes of the shikimic pathway, classically said to be absent planula larvae of this sea anemone contain 16S rRNA se- from ‘‘animals,’’ are encoded in the metazoan host’s genome. quences obtained from PCR amplicons attributed to bacteria, The extent to which such HGT, or the involvement of unsus- including those of the same groups (Flavobacteria and Pseudo- pected bacterial consorts, may account for the apparent monas) that we report here (H. Marlow and M. Q. Martindale, metabolic anomalies in cnidarians described in the Introduc- personal communication). tion, warrants further investigation. Understanding these pro- Bacterial associates of cnidarians have been known for at cesses may additionally provide critical insight into the cause least 30 years (e.g., refs. 31 and 32), and most recently they of metabolic dysfunction evoked by climate change and envi- have been visualized microscopically as epibionts and endo- ronmental stress, particularly in the fragile symbioses of symbionts in two species of freshwater Hydra (33) and as tropical corals and other marine cnidarians. envelope-wrapped aggregates in caverns between ectodermal cells of the nominally nonsymbiotic sea anemone Metridium Materials and Methods senile (34). Such an intimate association with metazoan cells The DNA sequence of the N. vectensis genome was downloaded from Stella- lacking an external physical barrier lends itself to direct Base version 1.0 (http://evodevo.bu.edu/stellabase) and translated into all six host–microbe interactions manifested variously as pathogenic- reading frames by using Transeq (www.ebi.ac.uk/emboss/transeq). For profile ity in corals (35), the development of the immune response in analyses, HMMER version 2.3.2 (http://hmmer.janelia.org) and release 20 of Cnidaria (33), and a close symbiotic integration culminating in the Pfam database (www.sanger.ac.uk/Software/Pfam) were used. Similarity searches used the BLAST service at NCBI (www.ncbi.nlm.nih.gov/BLAST). HGT from bacteria to host cnidarian as demonstrated here. Virtually nothing is known of the biosynthetic or other met- ACKNOWLEDGMENTS. We thank John R. Finnerty and James Sullivan of abolic function of bacteria symbiotic with cnidarian hosts, a StellaBase (21) for useful comments and advice and Madeline van Oppen topic that, like so many others in modern marine microbiology, (Australian Institute of Marine Science) and Anthony Smith (University of warrants investigation. London) for comment and advice in the preparation of this manuscript. Our insights regarding N. vectensis, and the revision of this manuscript, were HGT between bacteria and certain metazoans (ecdysozoans, greatly aided by the generous communication of information by Heather including insects and nematodes) was recently demonstrated Marlow and Mark Q. Martindale (University of Hawaii). This work was by Baldo et al. (36) to be more widespread than suspected. supported by a cooperation grant from the German Academic Exchange They noted that bacterial sequences have been regarded Service and the Ministry of Science, Education and Sports, Republic of Croatia (to D.H. and J.C.), a stipendium from the German Academic Ex- previously as contamination and systematically excluded by change Service (to A.S.), the School of Pharmacy, University of London (S.A. eukaryotic genome sequencing projects, possibly masking the and P.F.L.), the Australian Institute of Marine Science (W.C.D.), and the importance of such transfer in diverse invertebrates. Earlier, University of Maine (J.M.S.).

1. Muscatine L (1990) Coral reefs in Ecosystems of the World, ed Dubinsky Z (Elsevier, 3. Allemand D, Furla P, Benazet-Tambutte´S (1998) Mechanisms of carbon acquisition for Amsterdam), Vol 35, pp 75–87. endosymbiont photosynthesis in Anthozoa. Can J Bot 76:925–941. 2. Cook CB, D’Elia CF (1987) Host feeding and nutrient sufficiency for zooxanthellae in the 4. Furla P, et al. (2005) The symbiotic Cnidarian: A physiological chimera between alga sea anemone Aiptasia pallida. Symbiosis 4:199–212. and animal. Integr Comp Biol 45:595–604.

2536 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0707388105 Starcevic et al. 46 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

5. Shick JM, Dunlap WC (2002) Mycosporine-like amino acids and related Gadusols: 20. Putnam NH, et al. (2007) Sea anemone genome reveals ancestral eumetazoan gene Biosynthesis, acumulation, and UV-protective functions in aquatic organisms. Annu repertoire and genomic organization. Science 317: 86–94. Rev Physiol 64:223–262. 21. Sullivan JC, et al. (2006) StellaBase: The Nematostella vectensis Genomics Database. 6. Favre-Bonvin J, Bernillon J, Salin N, Arpin N (1987) Biosynthesis of mycosporines: Nucleic Acids Res 34:D495–D499. mycosporine glutaminol in Trichothecium roseum. Phytochemistry 29:2509–2514. 22. Richards TA, et al. (2006) Evolutionary origins of the eukaryotic shikimate pathway: 7. Shick JM, Romaine-Lioud S, Ferrier-Page`s C, Gattuso J -P (1999) Ultraviolet-B radiation Gene fusions, horizontal gene transfer, and endosymbiotic replacements. Eukaryot stimulates shikimate pathway-dependent accumulation of mycosporine-like amino Cell 5:1517–1531. acids in the coral Stylophora pistillata despite decreases in its population of symbiotic 23. Eddy SR (1988) Profile hidden Markov models. Bioinformatics 14:755–763. dinoflagellates. Limnol Oceanogr 44:1667–1682. 24. Bateman A, et al. (2002) The Pfam protein families database. Nucleic Acids Res 8. Portwich A, Garcia-Pichel F (2003) Biosynthetic pathway of mycosporines in the cya- 30:276–280. nobacterium Chlorogloeopsis PCC 6912. Phycologia 42:384–392. 25. Waller RF, Slamovits CH, Keeling PJ (2006) Lateral gene transfer of a multigene region 9. Shick JM, Dunlap WC, Pearse JS, Pearse VB (2002) Mycosporine-like amino acid content from cyanobacteria to dinoflagellates resulting in a novel plastid-targeted fusion in four species of sea anemones in the genus Anthopleura reflects phylogenetic but not protein. Protein Mol Biol Evol 23:1437–1443. environmental or symbiotic relationships. Biol Bull 203:315–330. 26. Technau U, et al. (2005) Maintenance of ancestral complexity and nonmetazoan genes 10. Herrmann KM, Weaver LM (1999) The shikimate pathway. Annu Rev Plant Physiol Mol in two basal cnidarians. Trends Genet 21:633–639. Biol 50:473–503. 27. Kondrashov FA, Koonin EV, Morgunov IG, Finogenova TV, Kondrashova MN (2006) 11. Fitzgerald LM, Szmant AM (1997) Biosynthesis of ‘‘essential’’ amino acids by sclerac- Evolution of glyoxylate cycle enzymes in Metazoa: Evidence of multiple horizontal tinian corals. Biochem J 322:213–221. transfer events and pseudogene formation. Biol Direct 1:31. 12. Wang J, Douglas AE (1998) Nitrogen recycling or nitrogen conservation in an alga- 28. Habetha M, Bosch TC (2005) Symbiotic Hydra express a plant-like peroxidase gene invertebrate symbiosis? J Exp Biol 201:2445–2453. during oogenesis. J Exp Biol 208:2157–2165. 13. Stochaj WR, Dunlap WC, Shick JM (1994) Two new UV-absorbing mycosporine-like 29. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search amino acids from the sea anemone Anthopleura elegantissima and the effects of tool. J Mol Biol 215:403–410. zooxanthellae and spectral irradiance on chemical composition and content. Mar Biol 30. Delcher AL, Bratke KS, Powers EC, Salzberg SL (2007) Identifying bacterial genes and 118:149–156. endosymbiont DNA with Glimmer. Bioinformatics 23:673–697. 14. Banaszak AT, Trench RK (2001) Ultraviolet sunscreens in dinoflagellates. Protist 31. Margulis L, Thorington G, Berger B, Stolz J (1978) Endosymbiotic bacteria associated 152:93–101. with the intracellular green algae of Hydra viridis. Curr Microbiol 1:229–232. 15. Banaszak AT, LaJeunesse TC, Trench RK (2000) Synthesis of MAA by symbiotic 32. Palincsar EE, Jones WR, Palincsar JS, Glogowski MA, Mastro JL (1989) Bacterial aggre- dinoflagellates in culture. J Exp Mar Biol Ecol 249:219–233. gates within the epidermis of the sea anemone Aiptasia pallida. Biol Bull 177:130–140. 16. Banaszak AT, Santos MGB, LaJeunesse TC, Lesser MP (2006) The distribution of 33. Fraune S, Bosch TC (2007) Long-term maintenance of species-specific bacterial micro- mycosporine-like amino acids (MAAs) and the phylogenetic identity of symbiotic biota in the basal metazoan Hydra. Proc Natl Acad Sci USA 104:13146–13151. dinoflagellates in cnidarian hosts from the Mexican Caribbean. J Exp Mar Biol Ecol 34. Schuett C, Doepke H, Grathoff A, Gedde M (2007) Bacterial aggregates in the tentacles 337:131–146. of the sea anemone Metridium senile. Helgol Mar Res 61:211–216. 17. De la Cruz R, Davies J (2000) Horizontal gene transfer and the origin of species: Lessons 35. Rosenberg E, Loya Y, eds (2004) Coral Health and Disease (Springer, Berlin).

from bacteria. Trends Microbiol 8:128–133. 36. Baldo L, et al. (2007) Multilocus sequence typing system for the endosymbiont Wol- GENETICS 18. Steele RE, Hampson SE, Stover NA, Kibler DF, Bode HR (2004) Probable horizontal bachia pipientis. Appl Environ Microbiol 72:7098–7110. transfer of a gene between a protist and a cnidarian. Curr Biol 14:R298–R299. 37. Andersson SGE (2006) Genetics: The bacterial world gets smaller. Science 314:259–260. 19. Darling JA, et al. (2005) Rising starlet: The starlet sea anemone, Nematostella vectensis. 38. Pe´rez-Brocal V, et al. (2006) A small microbial genome: The end of a long symbiotic BioEssays 27:211–221. relationship? Science 314:312–313.

Starcevic et al. PNAS ͉ February 19, 2008 ͉ vol. 105 ͉ no. 7 ͉ 2537 47 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

3 DISCUSSION

As we anticipated in the work objectives, to tackle DNA sequences of marine organisms, we had to use more sensitive existing bioinformatic tools as well as build our own. We have chosen HMM profiles as a major component of our future application because we believe them to be more sensitive to distant homologies. The big advantage of this approach was also the existence of Pfam database, the database of protein families, each represented by multiple sequence alignments and hidden Markov model (HMM) (Finn et al., 2008).

The first proof of principle came with the patellamides. Patellamides are a family of highly conserved thiazole- and oxazoline-containing cyclic octapeptides isolated from the Indo-Pacific seasquirt Lissoclinum patella. These compounds are novel examples of post-translationally modified ribosomal peptides. To make things harder, they seem to be produced by the unculturable Prochloron didemni prokaryote symbiont and not the seasquirt itself (Long et al., 2005). The gene cluster itself showed extremely low sequence similarity to genes of known

function (Schmidt et al., 2005) indicating that the pathway might contain novel enzymes. Using HMMs we were able to confirm the previous annotation of Schmidt et al. (Schmidt et al., 2005) which strengthened our belief that HMMs are a good diagnostic tool even for such unusual pathways. Also, we were able to detect new unknown domains in previously identified peptides.

Another good opportunity was the starlet sea anemone Nematostella vectensis. This basal metazoan was one of the first to have its genome sequenced (Putnam et al., 2007) and we were provided with the opportunity to mine the genome for evidence of genes encoding enzymes of the shikimic acid pathway. In the marine environments these organisms come from, concentrations of both inorganic and organic nutrients are often low. This makes symbiotic relationships favourable, such as photoautotrophic endosymbionts, which provide the animal hosts with organic carbon and essential amino acids and in return receive catabolism waste products and respiratory CO2. In fact it is known that dinoflagellate endosymbionts of marine invertebrates known as ‘‘zooxanthellae’’ are particularly common in association with anthozoan cnidarians such as corals and sea anemones. Interactions between host animal and its endosymbiont go far beyond the cycling of carbon and nutrients and must also include other metabolic adaptations to the marine environment. Our interest was in UV-absorbing

48 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

mycosporine-like amino acids (MAAs) that protect against the direct and indirect damaging effects of solar UV radiation. These compounds have been isolated from various free-living algae and marine invertebrate algal symbioses, including zooxanthellate and cnidarians. Since MAAs are putatively synthesized via a branch of the shikimic acid pathway at 3-dehydroquinate and since animals are believed to lack the shikimic acid pathway, it would be expected for them to be produced by the microbial partner, or obtained from the diet.

Our analysis using HMM profiles covering all seven enzymes involved in the shikimic acid pathway showed the presence of aroA- and aroB-related genes in large scaffolds. Because there was a possibility of DNA contamination from an unsuspected endosymbiotic dinoflagellate, we have analysed neighbouring genes; they were related to various vertebrate and sea urchin genes making contamination highly unlikely. Also, authors of the genome (Putnam et al., 2007) explicitly state that they prepared genomic DNA from from embryos and early planula larvae, in order to reduce the possibility of contamination. When the AroB hit was extracted and used for a BLAST search the closest relation was to an O-methyltransferase – AroB fusion protein gene from the dinoflagellate Oxyrrhis marina. When this fusion gene was compared to the N. vectensis translated DNA we indeed found the presence of a similar fusion. The fusion gene in N. vectensis had five intron-like segments in it which suggested animal origin (Fig. 4A).

Another surprising finding was seven good sequence alignments corresponding to five potential genes of the shikimic acid pathway. Among these were four very strong alignments corresponding to the genes aroA, aroB, aroC, and aroE of E. coli. Comparative study of glyoxylate cycle enzymes revealed the likely transfer of a bifunctional isocitrate lyase (ICL) and a malate synthase (MS), encoded by a fused ICL-MS gene from a bacterial precursor, to the N. vectensis genome (Kondrashov et al., 2006) so we initially suspected that a horizontal transfer of the shikimate pathway had occurred. However, BLAST searches told us that these five new genes corresponded to shikimic acid pathway genes found in Flavobacteria having more than 70% amino acid identity. In most cases Tenacibaculum sp. MED152 was the best match. Although, this new finding could be explained by a very recent horizontal gene transfer event,

49 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

A

B

Fig. 4. The Nematostella vectensis shikimic acid pathway genes identified in contigs 1 (A) and 2 (B). The contig 1 encodes the quinate 5-dehydrogenase gene (aroM) with the two frameshifts probably split by four introns. The contig 2 showed the DAHP synthetase I family gene (aroG) with no introns and frameshifts probably having procaryotic origin.

50 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

bacterial "contamination" appeared to be a more plausible explanation. Codon usage analysis and the fact that these hits appeared in rather small scaffolds supported this interpretation. The bacterial "contamination" may well be a symbiont of N. vectensis.

After these initial successes in applying HMMs we concluded that they represent a good alternative to similarity based approaches and included them in our ClustScan tool design. Modular type I polyketide synthases seemed to be the best candidates for implementing an initial top-down annotation system as they were well understood and had a clear hierarchical architecture. The literature made it possible to make good predictions based on critical residues regarding substrate specificity, domain activity and α carbon hydroxyl group stereochemistry. However, a major problem was the epimerisation step some of the methyl groups found on β- carbons underwent. We needed efficient prediction for methyl group stereochemistry to be able to create a tool which can not only identify gene clusters but also predict linear and cyclic structures of their aglycon products. It was generally believed that the stereochemistry was controlled by the KS domains, but it was not possible to find amino acid differences correlated with stereochemistry. Based on what we observed within known polyketides, we proposed that the KR domain has two rather than just one functionality. Its primary role would still be β-keto reduction, but it would also perform epimerization reactions on the methyl group where needed. Profile analysis indeed showed putative epimerization domain overlapping with known KR domains. Using homology modelling we were able to make a model of KR domain 3D structure since at the time there was no crystal structure available (Keatinge-Clay and Stroud, 2006). We identified conserved His residue critical for epimerization reaction besides previously known N, S, Y and K involved in reduction, which we included in the proposed epimerization mechanism.

As a major goal of our efforts, we have created the ClustScan program package for top-down rapid semi-automatic annotation of modular biosynthetic gene clusters namely: polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), siderophore (NIS) synthases and hybrid clusters based on HMM profiles. As model gene cluster we have chosen PKS but we have had encouraging results with NRPS and various hybrid clusters. In principle, any cluster for which profiles of domains can be built can be efficiently annotated by this tool. Predictions for linear and in the case of macrolides cyclic aglycon structures are so far only possible for modular

51 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

PKS gene clusters. We plan to expand this functionality to the rest of modular biosynthetic enzymes, NRPSs being the next logical step.

3.1 The role of ketoreductase domain in determining methyl stereochemistry

Erythromycin aglycone 6-deoxyerythronolide B is one of the most studied polyketides, especially regarding stereochemistry. Six modules housed within three polypeptides catalyse its condensation from propionyl-CoA starter and six (2S)–methylmalonyl-CoA extender units. Modules 1, 2, 5 and 6 contain an active KR domain; module 4 contains a complete set of reductive domains: namely KR, DH and ER; while module 3 has no reductive functionality although our profile analysis shows the presence of an inactive KR domain. Polyketide biosynthesis proceeds by a Claisen condensation with inversion of configuration to give a 2R methyl centre. However, in module 1, this decarboxylative condensation involves cleavage of the CH bond adjacent to the methyl group at C2 of the extender unit to produce a 2S methyl centre. This implies that there must be an epimerisation step taking place (Weissman et al., 1997). At the start of the thesis work it was generally assumed that the stereochemistry of the methyl group would be controlled by the KS domain that carries out the decarboxylative condensation.

Based on multiple alignments of KR domains, Caffrey already identified residues determining KR type and divided them into two groups regarding OH group stereochemistry (Caffrey, 2003). These groups, named A and B, correspond to S or R hydroxyl group configuration. Using HMM profiles of all epimerization domains present in the Pfam database (Finn et al., 2008), I ran hmmsearch (Eddy, 1998) analysis on PKS genes. The NAD-dependent epimerase/ /dehydratase family gave strong enough hits to be taken into consideration. Surprisingly, the putative epimerization motif was not in KS domains, but, was present in KR domains. A 3D model of the KR domain was constructed using homology modelling before Keatinge-Clay and Stroud's KR crystal structure (Keatinge-Clay and Stroud, 2006) was published. It was noticed that there is a conserved histidine residue in the active site cleft, which had not previously been assigned a functional rôle. Histidine is the most abundant amino acid in

52 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

protein active sites and binding sites and the ease with which protons can be transferred on and off of histidines makes them ideal for charge relay systems such as those found within catalytic triads (Betts and Russell, 2003). All these facts made us believe it was also involved in KR mediated epimerization (Milne et al., 2006).

The 3-D model of the KR domain did not suggest any other residues or active site clefts for epimerization other than the one already used for reduction so we proposed a bifunctional active site. However, it is known that some KR’s perform epimerization and some do not, so we tried to explain this behaviour by the fact that the primary function of every KR is reduction of a β-keto group. Epimerization in our model will only be observed if the reduction of the keto group with the neighbouring methyl group in the R configuration is not possible because of steric clashes. This rather simple hypothesis can explain why, for example, we observe 2S stereochemisty with a methylmalonyl substrate processed by module 3 of erythromycin DEBS1 which has a KR domain that is inactive in reduction, but still performs epimerisation, or why isolated KR domains behave differently (Fig. 5A and B) (Siskos et al., 2005; Starcevic et al., 2007). Later work comparing crystal structures of KR domains form modules that have different stereochemical outcomes (Keatinge-Clay, 2007) strongly supported the idea that the KR domain determines the stereochemistry of the methyl group in addition to that of the hydroxyl group.

3.2 Predicting linear products of modular polyketide gene clusters

Polyketides and non-ribosomally synthesised peptides are extremely important chemical substances for both pharmaceutical and agro-industry. Their biosynthetic pathways consist of simple building blocks being assembled into complex chemical structures by means of special enzyme complexes. Type I PKS are complex enzymes with multimodular organisation. Within these multifunctional enzymes, each module contributes one extending building block to the growing polyketide chain. Bioinformatic analyses revealed that this modular organisation is

53 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

A

B

SMILES: CC[C@@H](O)[C@@H](C)[C@H](O)[C@H](C)C(=O)C(C) C[C@@H](C)[C@H](O)[C@@H](C)[C@H](O)C(C)C(S)=O

54 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Fig. 5. Functional prediction of the DNA sequence (AY771999) of the erythromycin gene- cluster. The three genes (in red), six modules (underlined in red) and 29 catalytically active domains (in blue) of the erythromycin gene-cluster are shown. In particular, the inactive KR in module 3 responsible for the hydroxyl-stereochemistry: S stereochemistry was predicted (A). The isomeric SMILES as well as the 2-D structure of the predicted linear chain is also shown (B top and right). The cyclization function predicts a ring structure that can be displayed as the 3-D structure (B bottom and right).

transferred directly from DNA. In other words, the DNA sequences of type I polyketide gene clusters are also organised into modules (Weissman and Leadlay, 2005; Hranueli et al., 2005).

We wanted to use this knowledge to construct a software tool for prediction of linear and in some cases cyclic aglycone structures based on the DNA sequence alone. To achieve this, we had to overcome several obstacles. We had to ensure a more detailed and uniform annotation than had been previously achieved. This means having very sensitive methods for domain recognition and for the prediction of activity, specificity and stereochemistry. Another essential component was the graphical GUI (Graphical User Interface) allowing intuitive data management, so we could recognise clustered genes, assemble domains into modules etc. Even if all this pre-requisites are satisfied and one is able to scan entire genome or metagenome, identify a set of clustered genes with PKS domains inside which can be organised into modules, what about biosynthetic order? In cases where the biosynthetic order could not be deduced from the presence of loading domains and thioesterase domains, it was assumed that the natural order of the genes on their corresponding DNA strands agrees with biosynthetic order. This hypothesis was supported in most cases by analysis of known gene clusters. Clusters having all of their genes on one DNA strand, agreed with this rule, others having genes on both strands proved to be more difficult but not unsolvable. If a loading module or a TE domain can be identified, ClustScan’s GUI allows genes to be manually re-arranged so that all one has to do is to pull the gene containing the loading module to the beginning of cluster and/or gene containing TE domain to the end. Genes in the middle arrange themselves according to their respective DNA position again in concordance with polycistronic mRNA hypothesis (Starcevic et al., 2008b).

55 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

To be able to produce linear or cyclic chemical structures, we needed to establish connections between building blocks incorporated by a particular module and the complete gene cluster annotation. For this purpose we have built a PKS building block chemical library containing all known building blocks in SMILES format (Weininger, 1988). This library is a part of separate computer program we have written in Java. This program collects the predictions for AT domain specificity and KR domain linked –OH and –Me stereochemistry along with module composition summarized as reduction potential (e.g. reductive domains KR, DH and ER presence and activity respectively) produced by ClustScan's multiple alignment-based prediction programs. All this data is summarized in the form of an XML (EXtensible Markup Language) document representing the complete gene cluster which is readable by our chemical library program. The program then parses the information stored in the XML gene cluster representation and correctly chooses the building block each module incorporates. This is by no means trivial since our goal is to get the correct stereochemistry and reduction and that increases the number of potential choices. For example, for a malonate building block we can have up to eight choices depending on stereochemistry reduction degree (defined by module composition, e.g. number of reduction domains present and active), to choose from. For more complex building blocks, the number of choices is even greater. Also, the choice of building block is made by the corresponding module's AT domain, whereas the reduction of the keto group and fixation of – OH/-Me stereochemistry is determined by the subsequent one. Finally, ClustScan is able to present one with linear and in case of macrolides, cyclic agylcone structure which can be viewed and/or copied in SMILES format (Weininger, 1988), for editing in some other chemistry program of choice.

56 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

3.3 Microbial origins of shikimic acid pathway enzymes encoded in the genome of Nematostella vectensis

The shikimic acid pathway has seven steps catalyzed by seven different enzyme activities (Knaggs, 2003). The enzymes for five of the seven biosynthetic steps are homologous in all organisms possessing the pathway. For two of the steps there are two different enzymes known for each and every organism with the pathway present has a homologue to one of these. One of the goals of this part of doctoral dissertation was to probe an animal N. vectensis for presence of this pathway. If indeed present, the shikimic pathway genes might be difficult to detect, because the sequences would probably diverge considerably from any of the comparison sequences we might choose to use. Also we would probably have to deal with introns which tend to make sequence comparison more difficult since they break coding sequences into smaller parts and thus make weaker homologies more difficult to detect. To overcome these problems we have chosen to use HMM profiles and the HMMER (Eddy, 1998) suite of programs. The first step consisted of collecting Pfam (Finn et al., 2008) profiles to cover all seven steps of this biosynthetic pathway; this required nine profiles. Then the genome of N. vectensis was translated in all 6 frames, in order to be searched with profiles.

The results were better than expected since we picked up clear aroA and aroB signals. Cross- referenced with GenBank the aroA hit initially suggested a bacterial origin, but its presence in the large scaffold_33 and the fact that we identified a putative intron near the end of the sequence with plausible AGGTRA and AGG splice sites made a eukaryotic origin clear. Closer examination of scaffold_33 with the entire Pfam database revealed other dominantly eukaryotic domains neighbouring the aroA homolog again supporting idea that it belongs to the N. vectensis genome. Phylogenetic tree constructed out of N. vectensis AroA and sequences obtained fom GenBank via BLAST [mostly from Tenacibaculum sp. MED152 (Gonzalez et al. 2008) and Escherichia coli W3110 (Blattner et al. 1997)] showed that N. vectensis sequence didn't cluster with bacterial sequences. The second alignment, the aroB hit from scaffold_85 gave us interesting BLAST results. It proved to be most closely related to the dinoflagellate O. marina. The O. marina aroB gene is fused to an O-methyltransferase gene (Waller et al., 2006).

57 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Surprisingly, when we extracted the fused O. marina protein and searched it against translated scaffold_85 we identified an aroB-methylase fusion in N. vectensis also. We found five introns in this gene. Phylogenetic analysis with best BLAST hits indicated that two dinoflagellates O. marina and Heterocapsa triquetra both had fused proteins related to N. vectensis (Starcevic et al., 2008a).

Another surprising finding was an additional seven good sequence alignments corresponding to five potential shikimic acid pathway genes. In all cases, the BLAST search gave best matches to Flavobacteria related shikimic acid pathway genes. Tenacibaculum sp. MED152 was the best hit in most cases. Codon usage analysis and the fact that all these hits appeared in small scaffolds strongly suggested a bacterial origin of these sequences. We have concentrated on Tenacibaculum since it gave good matches and the genome was already sequenced. When we compared Tenacibaculum predicted proteins against N. vectensis translated scaffolds, using a very stringent E-value threshold of 10-30, we found out that almost 19% of Tenacibaculum sequences showed good alignments. Under less stringent condition of E-value threshold 10-10 which is still significant, we got 58% of matching sequences, i.e. roughly half of all CDSs. However, authors of the genome sequence (Putnam et al., 2007) stated that they used only carefully purified embryos and early planula larvae to extract DNA. It, therefore, seems likely that the sequences are derived from a symbiotic bacterium, possibly an endosymbiont. Laboratory experiments are being carried out to try to detect the postulated endosymbiont. The aroA and aroB homologues present in large scaffolds seem to be derived from horizontal gene transfer into N. vectensis probably from a bacterial and a dinoflagellate source respectively (Starcevic et al., 2008a).

3.4 Carsonella rudii and Buchnera aphidicola – another case of horizontal gene transfer

To find further evidence for our horizontal gene transfer (HGT) endosymbiont hypothesis, we investigated genome sequence of the bacterial endosymbiont C. ruddii (Andersson, 2006; Pérez-Brocal et al. 2006) which is found in aphids. When we compared it with yet another

58 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

endosymbiont of aphids, B. aphidicola, we have seen that both genomes have considerable parts missing. One missing essential pathway caught our attention. It was the tryptophan biosynthesis pathway. Precursors for aromatic amino acids are synthesized by the shikimic acid pathway (Herrmann and Weaver, 1999). When we searched for shikimic acid pathway genes in these genomes, we found only a part of the genes needed for this biosynthetic pathway. Moreover, for some of the alignments we got it was questionable if they represent functional genes at all (Starcevic et al., 2008a).

Comparing these results with what we observed in N. vectensis and its proposed Tenacibaculum-like endosymbiont, we suspect that loss of essential in symbiont could be explained with gradual gene transfer and deletion of endosymbiont's genetic material. Given enough time, this process could eventually lead to extinction of the symbiont by progressive host assimilation (Andersson, 2006; Pérez-Brocal et al., 2006).

59 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

4 OUTLOOK FOR FUTURE WORK

The work presented in this thesis showed the utility of a top-down approach to annotation in special cases. There were three main technical components contributing to the success of the work: 1. Use of HMM-profiles with translated DNA sequences. This gave sensitive detection of protein families and was resistant to problems caused by introns and sequencing errors. It also allowed accurate assignment of critical residues for predicting specificity of enzymes. 2. The use of XML to control knowledge flow in programs. This resulted in a rigorous definition of information and made it easy to communicate between different programs. 3. The development of an intuitive graphical user interface to give a simple overview defined in biological terms as well as allowing access to details of the analyses when desired. The interface also makes it easy for users to override default predictions and incorporate extra knowledge.

The principles that were implemented for the special cases can be expanded to allow generalised annotation. This requires particular care in developing the XML definitions, which ultimately control the top-down structure. The rapid progress in DNA sequencing technology, which will result in an explosion in the number of new genomic sequences, needs such a tool to detect the interesting features in the mass of data. This will require generic genome annotation tool that would be developed based on ClustScan’s architecture. This new software should be able to perform on eukaryotic genomes in a similar way ClustScan performs on prokaryotic. This new system could also be developed further to allow annotation of individual human genomes. For this purpose, it would also be necessary to find effective methods of determining haplotypes that are linked to quantitative trait loci (QTL).

60 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

5 ABSTRACT

Modular biosynthetic clusters represent an extremely rich source of biologically active compounds that find wide-ranging applications. Their biosynthetic pathways consist of successive polymerization steps where simple building blocks get incorporated into a growing chain. This is made possible by fascinating enzymes operating in assembly line fashion. The two major classes are polyketide synthases (PKS) and non-ribosomal peptide synthetases (NRPS). They are organized in modules, each module being responsible for the introduction of one additional building block. The vast majority of useful secondary metabolites (antibiotics being the most important) have been detected by screening of „wild“ isolates obtained from soil and other habitats. Recent studies revealed a plethora of uncultivable microorganisms in both soil and marine environments. Such a metagenomic approach is showing great perspective in isolation of novel biocatalysts from the uncultured microbiota. A new software tool (ClustScan) for rapid genome and metagenome gene clusters scanning and annotation was developed. An important aim of this work was the prediction of the chemical structures of products and, in particular their stereochemistry. By analysing known cluster sequences, it was found that the stereochemistry of the methyl groups was determined by the ketoreductase domains and not the ketosynthase domains as previously believed. ClustScan proved its utility by rapidly annotating PKS/NRPS gene clusters in genomes of Sacc. erythraea and S. scabies. It was used in annotation of PKS clusters of the slime mould Dictyostelium discoideum and the analysis of a hybrid PKS-NRPS gene-cluster from the J. Craig Venter Institute Global Ocean Sampling (GOS) Expedition metagenomic dataset. The genome sequence of the starlet sea anemone N. vectensis was searched for shikimic acid pathway genes, which are not known to be present in animals. Surprisingly, two genes were detected that had probably been acquired by horizontal gene transfer (HGT) from dinoflagellates and bacteria respectively. This analysis also detected bacterial sequences that might belong to an unidentified endosymbiont. We also studied two known endosymbionts C. ruddii and B. aphidicola both found in aphids and showed that there was considerable gene loss compared to free-living bacteria.

61 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

6 REFERENCES

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410.

Andersson, S.G.E. (2006) Genetics: The bacterial world gets smaller. Science 314, 259-260.

Bentley, S.D., Chater, K.F., Cerdeno-Tarraga, A.M., Challis, G.L., Thomson, N.R., James, K.D., Harris, D.E., Quail, M.A., Kieser, H., Harper, D., et al. (2002) Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417, 141-147.

Caffrey, P. (2003) Conserved amino acid residues correlating with ketoreductase stereospecificity in modular polyketide synthases. ChemBioChem 4, 654-657.

Betts, M.J. and Russell, R.B. (2003) Amino acid properties and consequences of substitutions. Bioinformatics for Geneticists, p.p. 1-350. Barnes, M.R. and I.C. Gray (eds.), John Wiley & Sons, New York.

Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado- Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277, 1453-1474.

Castonguay, R., He, W., Chen, A.Y., Khosla, C. and Cane, D.E. (2007) Stereospecificity of ketoreductase domains of the 6-deoxyerythronolide B synthase. J. Am. Chem. Soc. 129, 13758-13769.

Challis, G.L. (2005) A widely distributed bacterial pathway for siderophore biosynthesis independent of nonribosomal peptide synthetases. ChemBioChem 6, 601-611.

Donadio, S., Monciardini, P. and Sosio, M. (2007) Polyketide synthases and nonribosomal peptide synthetases: the emerging view from bacterial genomics. Nat. Prod. Rep. 24, 1073- 109.

Dunlap, W.C., Jaspars, M., Hranueli, D., Battershill, C.N., Peric-Concha, N., Zucko, J., Wright, S.H. and Long, P.F. (2006) New methods for medicinal chemistry - Universal gene cloning and expression systems for production of marine bioactive metabolites. Curr. Med. Chem. 13, 697-710.

Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics 14, 755-763.

Finking, R. and Marahiel, M.A. (2004) Biosynthesis of non-ribosomal peptides. Ann. Rev. Microbiol. 58, 453-488.

62 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, J.S., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L. and Bateman, A. (2008) The Pfam protein families database. Nucleic Acids Res. 36, D281-D288.

Feher, M. and Schmidt, J.M. (2003) Property distributions: Differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci. 43, 218- 227.

Fleming, A. (1929) On the antibacterial action of cultures of a penicillium, with special reference to their use in the isolation of B. influenza. Br. J. Exp. Pathol. 10, 226-236.

Gonzalez, J.M., Fernandez-Gomez, B., Fernandez-Guerra, A., Gomez-Consarnau, L., Sanchez, O., Coll-Llado, M., Del Campo, J., Escudero, L., Rodriguez-Martinez, R., Alonso-Saez, L., et al. (2008) Genome analysis of the proteorhodopsin-containing marine bacterium Polaribacter sp. MED152 (Flavobacteria). Proc. Natl. Acad. Sci. USA 105, 8724-8729.

Herrmann, K.M. and Weaver, L.M. (1999) The shikimate pathway. Annu. Rev. Plant Physiol. Mol. Biol. 50, 473-503.

Hranueli, D., Cullum, J., Basrak, B., Goldstein, P. and Long, P.F. (2005) Plasticity of the Streptomyces genome - evolution and engineering of new antibiotics. Curr. Med. Chem. 12, 1697-1704.

Hranueli, D., Zucko, J., Starcevic, A., Diminic, J., Elbekali, M.. Lisfi, M., Long, P.F. and J. Cullum. Bioinformatic tools for mining large DNA data sets. Int. J. Artif. Intell. M. 2009. (in press)

Ikeda, H., Ishikawa, J., Hanamoto, A., Shinose, M., Kikuchi, H., Shiba, T., Sakaki, Y., Hattori, M. and Omura, S (2003) Complete genome sequence and comparative analysis of the industrial microorganism Streptomyces avermitilis. Nat. Biotechnol. 21, 526-531.

Keatinge-Clay, A.T. (2007) A tylosin ketoreductase reveals how chirality is determined in polyketides. Chem. Biol. 14, 898-908.

Keatinge-Clay, A.T. and Stroud, R.M. (2006) The structure of a ketoreductase determines the organization of the beta-carbon processing enzymes of modular polyketide synthases. Structure 14, 737-748.

Knaggs, A.R. (2003) The biosynthesis of shikimate metabolites. Nat. Prod. Rep. 20, 119-136.

Kondrashov, F.A., Koonin, E.V., Morgunov, I.G., Finogenova, T.V. and Kondrashova, M.N. (2006) Evolution of glyoxylate cycle enzymes in Metazoa: Evidence of multiple horizontal transfer events and pseudogene formation. Biol. Direct. 1, 1-14.

63 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Kwan, D.H., Sun, Y., Schulz, F., Hong, H., Popovic, B., Sim-Stark, J.C., Haydock, S.F. and Leadlay, P.F. (2008) Prediction and manipulation of the stereochemistry of enoylreduction in modular polyketide synthases. Chem Biol. 15, 1231-1240.

Long, P.F., Dunlap, W.C., Battershill, C.N. and Jaspars, M. (2005) Shotgun cloning and heterologous expression of the patellamide gene cluster as a strategy to achieving sustained metabolite production. ChemBioChem 6, 1760-1765.

Milne, B.F., Long, P.F., Starcevic, A., Hranueli, D. and Jaspars, M. (2006) Spontaneity in the patellamide biosynthetic pathway. Org. Biomol. Chem. 4, 631-638.

Ohnishi, Y., Ishikawa, J., Hara, H., Suzuki, H., Ikenoya, M., Ikeda, H., Yamashita, A., Hattori, M. and Horinouchi, S. (2008) Genome sequence of the streptomycin-producing microorganism Streptomyces griseus IFO 13350. J. Bacteriol. 190, 4050-4060.

Oliynyk, M., Samborskyy, M., Lester, JB., Mironenko, T., Scott, N., Dickens, S., Haydock, S.F. and Leadlay, P.F. (2007) Complete genome sequence of the erythromycin-producing bacterium Saccharopolyspora erythraea NRRL23338. Nat. Biotechnol. 25, 447-453.

Pérez-Brocal, V., Gil, R., Ramos, S., Lamelas, A., Postigo, M., Michelena, J.M., Silva, F.J., Moya, A. and Latorre, A. (2006) A small microbial genome: The end of a long symbiotic relationship? Science 314, 312-313.

Putnam, N.H., Srivastava, M., Hellsten, U., Dirks, B., Chapman, J., Salamov, A., Terry, A., Shapiro, H., Lindquist, E., Kapitonov, V.V., et al. (2007) Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86-94.

Rausch, C., Weber, T., Kohlbacher, O., Wohlleben, W. and Huson, D.H. (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res. 33, 5799-5808.

Rusch, D.B., Halpern, A.L., Sutton, G., Heidelberg, K.B., Williamson, S., Yooseph, S., Wu, D., Eisen, J.A., Hoffman, J.M., Remington, K., et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 5, e77.

Sattely, E.S., Fischbach, M.A., Walsh, C.T. (2008) Total biosynthesis: in vitro reconstitution of polyketide and nonribosomal peptide pathways. Nat Prod Rep, 25, 757-93.

Schmidt, E.W., Nelson, J.T., Rasko, D.A., Sudek, S., Eisen, J.A., Haygood, M.G. and Ravel, J. (2005) Patellamide A and C biosynthesis by a microcin-like pathway in Prochloron didemni, the cyanobacterial symbiont of Lissoclinum patella. Proc. Natl. Acad. Sci. USA 102, 7315- 7320.

64 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Schuster, S.C. (2008) Next-generation sequencing transforms today's biology. Nat. Methods 5, 16-18.

Shaffer, C. (2007) Next-generation sequencing outpaces expectations. Nat Biotechnol. 25, 149.

Siskos, A.P., Baerga-Ortiz, A., Bali, S., Stein, V., Mamdani, H., Spiteller, D., Popovic, B., Spencer, J.B., Staunton, J., Weissman, K.J. and Leadlay, P.F. (2005). Molecular basis of Celmer's rules: stereochemistry of catalysis by isolated ketoreductase domains from modular polyketide synthases. Chem. Biol., 12, 1145-1153.

Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Starcevic, A., Cullum, J., Jaspars, M., Hranueli D. and Long, P.F. (2007) Predicting the nature and timing of epimerisation on a modular polyketide synthase. ChemBioChem 8, 28-31.

Starcevic, A., Akthar, S., Dunlap, W.C., Shick, J.M., Hranueli, D., Cullum, J. and Long, P.F. (2008a) Enzymes of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins. Proc. Natl. Acad. Sci. USA 105, 2533-2537.

Starcevic, A., Zucko, J., Simunkovic, J., Long. P.F., Cullum, J. and Hranueli, D. (2008b) ClustScan: An integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures. Nucleic Acids Res., 36, 6882-6892.

Tae, H., Kong, E.B. and Park, K. (2007) ASMPKS: an analysis system for modular polyketide synthases. BMC Bioinformatics 8, 327.

Waller, R.F., Slamovits, C.H. and Keeling, P.J. (2006) Lateral gene transfer of a multigene region from cyanobacteria to dinoflagellates resulting in a novel plastid-targeted fusion protein. Protein Mol. Biol. Evol. 23, 1437-1443.

Watve, M.G., Tickoo, R., Jog, M.M. and Bhole, B.D. (2001) How many antibiotics are produced by the genus Streptomyces? Arch. Microbiol. 176, 386-390.

Weininger, D. (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inform. Comput. Sci. 28, 31-36.

Weissman, K.J. and Leadlay, P.F. (2005) Combinatorial biosynthesis of reduced polyketides. Nat. Rev. Microbiol. 3, 925-936.

Weissman, K.J., Timoney, M., Bycroft, M., Grice, P., Hanefeld, U., Staunton, J. and Leadlay, P.F. (1997) The molecular basis of Celmer's rules: The stereochemistry of the condensation step in chain extension on the erythromycin polyketide synthase. Biochemistry 36, 13849- 13855.

65 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Yadav, G., Gokhale, R.S. and Mohanty, D. (2003) SEARCHPKS: a program for detection and analysis of polyketide synthase domains. Nucleic Acids Res. 31, 3654-3658.

Zazopoulos, E., Huang K., Staffa, A., Liu, W., Bachmann, B.O., Nonaka, K., Ahlert, J., Thorson, J.S., Shen, B. and Farnet, C.M. (2003) A genomics-guided approach for discovering and expressing cryptic metabolic pathways. Nat. Biotechnol. 21, 187-190.

Zotchev, S.B., Stepanchikova, A.V., Sergeyko, A.P., Sobolev, B.N., Filimonov, D.A. and Poroikov, V.V. (2006) Rational design of macrolides by virtual screening of combinatorial libraries generated through in silico manipulation of polyketide synthases. J. Med. Chem. 49, 2077-2087.

66 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

ACKNOWLEDGEMENTS

All scientific work included in this thesis was done in the Department of Genetics, Technische Universität Kaiserslautern and partly in the Section for bionformatics, Faculty of Food Technology and Biotechnology, University of Zagreb under the guidance of Prof. Dr. John Cullum and Prof. Dr. Daslav Hranueli as a part of bilateral cooperation between Germany and Croatia on a scientific project: "In silico studies of recombination in modular biosynthetic clusters". Software mentioned within is being filed as a patent application

I wish to thank Prof. Dr. Daslav Hranueli and Prof. Dr. John Cullum for supervising me and Prof. Dr. Regine Hakenbeck and Prof. Dr. Matthias Hahn for agreeing to serve in the examination committee.

I’m also grateful for my friends Jurica Zucko, Janko Diminic , Mouhsine Elbekali and Sameh H. Soror with whom I had pleasure to work and whose company I enjoyed.

Thanks to Nina, Erna and Ingeborg for helping me not getting “Lost in Translation”.

And finally thanks to Lucija and my parents for all the love and patience which make the toughest things seem solvable.

67 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

APPENDICES

All the appendices stated below are supplied on the CD rom.

1. The entire PhD thesis in PDF format.

2. The supplementary material from the paper: "Starcevic, A., J. Cullum, M. Jaspars, D. Hranueli & P.F. Long. Predicting the nature and timing of epimerisation on a modular polyketide synthase. ChemBioChem, 8, 28–31, 2007".

3. The supplementary material from the paper: "Starcevic, A., J. Zucko, J. Simunkovic, P.F. Long, J. Cullum & D. Hranueli. ClustScan: An integrated program package for the semi- automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures. Nucleic Acids Res., 36, 6882-6892, 2008.".

4. The supplementary material from the paper: "Starcevic, A., S. Akthar, W.C. Dunlap, J.M. Shick, D. Hranueli, J. Cullum & P.F. Long. Enzymes of the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella vectensis, have microbial origins. Proc. Natl. Acad. Sci. USA, 105, 2533-2537, 2008.".

68 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

LANIŠTE 5D, 10000 ZAGREB PHONE 098/491-393, 051/255-487 • E-MAIL [email protected] ANTONIO STARČ EVIĆ

PERSONAL DATA

Date of Birth: 08. 04. 1979. Place of Birth: Rijeka, Croatia Nationality: Croatian

EDUCATION

1994 – 1998 Andrija Mohorovičić Highschool Rijeka

1998 – 1999 Faculty of Pharmacy and Biochemistry Zagreb

1999 – 2004 Faculty of Food Technology and Biotechnology Zagreb

Diplom thesis:

Construction of relational database for modular polyketide synthases and nonribosomal peptide synthetases

PROFESSIONAL EXPERIENCE

April, 2005 – February, 2006: Junior Research Assistant at the Section for Bioinformatics, Faculty of Food Technology and Biotechnology, University of Zagreb.

March, 2006 – February, 2008: Scientific assistant in the Department of Genetics, University of Kaiserslautern

March, 2008 – up till now: Junior Research Assistant at the Section for Bioinformatics, Faculty of Food Technology and Biotechnology, University of Zagreb.

69 Starcevic A. In silico studies of modular biosynthetic clusters. Doctoral dissertation. Kaiserslautern, University of Kaiserslautern, Dept. of Genetics, January 2009

Hiermit versichere ich, die vorliegende Dissertation in der Abteilung Genetik der Universität Kaiserslautern selbständig durchgeführt und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet zu haben.

Kaiserslautern, Januar 2009

Antonio Starčević

70