View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by MPG.PuRe

GigaScience, 8, 2019, 1–13

doi: 10.1093/gigascience/giz081 Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 Research

RESEARCH Toxins from scratch? Diverse, multimodal gene origins in the predatory robber Dasypogon diadema indicate a dynamic venom evolution in dipteran Stephan Holger Drukewitz 1,2,*, Lukas Bokelmann3,EivindA.B.Undheim4,5 and Bjorn¨ M. von Reumont 2,6,7,*

1Institute for Biology, University of Leipzig, Talstrasse 33, 04103 Leipzig, Germany; 2Project group Bioresources, Venomics, Fraunhofer Institute for Molecular Biology and Applied Ecology, Winchesterstrasse 2, 35392 Gießen, Germany; 3Evolutionary Genetics Department, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, D-04103 Leipzig, Germany; 4Centre for Advanced Imaging, The University of Queensland, St. Lucia, QLD 4072, Australia; 5Centre for Ecology and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066 Blindern, 0316 Oslo, Norway; 6LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG), Senckenberganlage 25, 60325 Frankfurt, Germany and 7Institute for Biotechnology, Justus Liebig University, Heinrich Buff Ring 58, 35394 Gießen, Germany

∗Correspondence address. Stephan Holger Drukewitz, Project group Bioresources, Animal Venomics, Fraunhofer Institute for Molecular Biology and Applied Ecology, Winchesterstrasse 2, 35392 Gießen, Germany, E-mail: [email protected] http://orcid.org/0000-0003-2482-9342;Bjorn¨ M. von Reumont, Institute for Insect Biotechnology, Justus Liebig University, Heinrich Buff Ring 58, 35394 Gießen, Germany, E-mail: [email protected] http://orcid.org/0000-0002-7462-8226

Abstract Background: Venoms and the toxins they contain represent molecular adaptations that have evolved on numerous occasions throughout the animal kingdom. However, the processes that shape venom protein evolution are poorly understood because of the scarcity of whole-genome data available for comparative analyses of venomous species. Results: We performed a broad comparative toxicogenomic analysis to gain insight into the genomic mechanisms of venom evolution in robber (). We first sequenced a high-quality draft genome of the hymenopteran hunting robberfly Dasypogon diadema, analysed its venom by a combined proteotranscriptomic approach, and compared our results with recently described robber fly venoms to assess the general composition and major components of asilid venom. Wethen applied a comparative genomics approach, based on 1 additional asilid genome, 10 high-quality dipteran genomes, and 2 lepidopteran outgroup genomes, to reveal the evolutionary mechanisms and origins of identified venom proteins in robber flies. Conclusions: While homologues were identified for 15 of 30 predominant venom protein in the non-asilid genomes, the remaining 15 highly expressed venom proteins appear to be unique to robber flies. Our results reveal that the venom of D. diadema likely evolves in a multimodal fashion comprising (i) neofunctionalization after gene duplication, (ii) expression-dependent co-option of proteins, and (iii) asilid lineage-specific orphan genes with enigmatic origin. The role of such orphan genes is currently being disputed in evolutionary genomics but has not been discussed in the context of toxin evolution. Our results display an unexpected dynamic venom evolution in asilid insects, which contrasts the findings of the

Received: 7 March 2019; Revised: 7 May 2019; Accepted: 14 June 2019 C The Author(s) 2019. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

1 2 Toxins from scratch? Dynamic venom evolution in dipteran insects only other insect toxicogenomic evolutionary analysis, in parasitoid wasps (Hymenoptera), where toxin evolution is

dominated by single gene co-option. These findings underpin the significance of further genomic studies to cover more Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 neglected lineages of venomous taxa and to understand the importance of orphan genes as possible drivers for venom evolution.

Keywords: toxin gene evolution; orphan genes; venom evolution; single gene co-option; gene duplication; comparative venom-genomics;

Introduction rufibarbis and Machimus arthriticus) to determine major venom components in robber flies [12], and compared our results with The predominant scenario for the evolution of a new gene func- a third, recently published study of the Australian giant robber tion presumes that gene duplication is followed by neo- or fly (Dolopus genitalis)[15]. sub-functionalization of 1 of the copies, resulting in a novel The mechanisms by which the identified venom proteins gene function [1, 2]. To differentiate mechanisms of gene ori- evolved in D. diadema were subsequently inferred by perform- gin, a larger taxon sampling and high quality of utilized whole- ing an extensive comparative genomics analysis. To reveal the genome data are mandatory. This objective is now more achiev- evolutionary origin of asilid venom proteins, we sequenced, as- able because of the fast development in next-generation se- sembled, and annotated a high-quality draft genome of D. di- quencing technology. However, whole-genome data for compar- adema, and co-annotated a recently published genome of the ative analyses are still sparse in evolutionary venomics (Supp. asilid Proctacanthus coquillettii [16]. We then compared these to Table 1) and, as a consequence, the relative importance of the publicly available high-quality genomes of 10 dipteran and 2 lep- underlying mechanisms in the evolution of venom proteins and idopteran model organisms. Our results reveal a complex, mul- peptides remains to be addressed in more detail. timodal pattern for the origin of venom proteins, and that the Venoms have evolved across a wide range of animal lin- venom of D. diadema evolved dynamically through mechanisms eages as important evolutionary traits that are used for preda- that include both gene duplication and single gene co-option. tion, defense, or competition [3–6]. They are cocktails of bioac- The venom proteins partly originate from genes with ancestral tive molecules that are usually composed mainly of peptides variants already present in the protein-coding genome of the and proteins, collectively referred to as “toxins,” that often ex- last common ancestor (LCA) of Diptera and Lepidoptera. Other hibit a variety of pharmacological properties linked to their toxi- putative toxins are lineage-specific to robber flies and show no city. These venom proteins and peptides have evolved new toxic detectable homologues outside the asilid genomes. Our results functions from non-toxic ancestral versions, and they are thus are based on the largest comparative genomics data set in evo- ideal candidates to test classical hypotheses on the evolution of lutionary venomics to date and demonstrate the potential and new gene functions. necessity of comparative genomics to understand venom evolu- However, only a few comparative studies based on whole- tion in a broader context. genome data have explored the different mechanisms that in- stigate the origin of toxin genes. In general, toxin evolution by gene duplication represents a widely accepted hypothesis and Results receives support as a major mechanism of toxin origin from ge- The venom system of Dasypogon diadema nomic analyses of the king cobra (Ophiophagus hannah), the Chi- nese scorpion (Mesobuthus martensii), and the Brazilian white- To compare the venom delivery system of D. diadema with those knee tarantula (Acanthoscurria geniculata)[7–9]. In contrast, anal- of previously described asilid species, we examined the mor- yses of the genomes of the platypus (Ornithorhynchus anatinus) phology of its venom apparatus by performing synchrotron- and parasitic wasps (Nasonia vitripennis, Trichomalopsis sarcopha- based microcomputer tomography reconstructions of both a gae) found that in these lineages, co-option of single-copy genes male and a female specimen. We found no differences between reflects the dominating process that shapes toxin evolution10 [ , the compared male and female specimen of D. diadema; how- 11]. Nevertheless, the available genomes of venomous taxa often ever, to discount sexual dimorphism in asilid venom systems, reflect improper sampling densities of the respective lineages this result should be combined with a larger sampling size per (Supp. Table 1). As a consequence, there is a need for compara- sex for definite conclusions. The venom apparatus of D. diadema tive approaches, which add more genome data to clades of in- appears generally similar to the previously described structures terest and suitable outgroups, to provide a better understanding of E. rufibarbis [12], with the exception that the venom appara- of general processes in toxin evolution. tus of D. diadema features more complex and elongated, sub- In this study, we examine the processes that drive toxin evo- structured thoracic venom glands (Fig. 1). lution in robber flies (Asilidae, Diptera), which is one of the Complementing our morphological analysis, the venom largest extant fly groups and includes >7,000 species [6, 12]. composition of D. diadema was investigated by applying a com- Asilids are also the only known clade within dipteran insects in bination of venom gland, proboscis, and body tissue transcrip- which both sexes use venom for an adult predatory lifestyle [6, tomics and a proteomic analysis of venom gland extracts from 12]. We first characterized the venom system of male and female both sexes. Apart from a more complex morphology, the venom specimens of Dasypogon diadema using a combination of func- cocktail of D. diadema showed a number of differences compared tional morphology, venom gland transcriptomics, and venom with the described venom of E. rufibarbis and M. arthriticus [12]. proteomics. D. diadema is of particular interest because it spe- The most striking disparity is that the venom of D. diadema con- cializes in hunting hymenopterans, which possess venom that tained chitinase-like proteins and proteins that belong to the can be used in defense and thus represent potentially danger- catabolite gene activator protein (CAP) superfamily, which were ous prey [13, 14]. We also utilized transcriptome and proteome absent in the venoms of E. rufibarbis and M. arthriticus (Fig. 2). The data from the venom of 2 additional European asilids (Eutolmus expression level of transcripts coding for chitinase-like proteins Drukewitz et al. 3

Table 2) [18, 19]. We also used our venomic data to re-annotate

the first high-quality robber fly genome, of P. coquillettii [16], and Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 to annotate the D. diadema genome sequenced and assembled in the present study (Table 1; Supp. Table 3, accession numbers of SRA and BioSample entries for transcriptome and genome data are linked to the BioProject PRJNA361480, see also Data availabil- ity section). Both robber fly genome annotations were refined by including all transcriptomic and proteomic data of asilid venom glands during the annotation. Gene sets of dipterans and lepidopterans obtained from EN- SEMBL scored a 68.9–99.7% completeness when analysed with BUSCO (Table 1)[20, 21]. The presented sets of protein-coding genes of the robber flies P. coquillettii and D. diadema match this range, scoring 96.7% and 91.1% completeness, revealing high- quality annotations and assembly completeness (Table 1).

Assessing ancestral gene variants Fig. 1: The 3D reconstructed venom delivery system of female and male Dasy- pogon diadema. The general anatomy of D. diadema is similar between both The protein-coding genomes of D. diadema and P. coquillettiii, sexes and to the structures described for . Apairofelon- 10 non–robber fly dipterans, and 2 lepidopterans were com- gated sac-like glands located in the first and second thoracic segments (right pared and sorted using the Orthofinder pipeline (Table 1)[18]. and left glands coloured red and orange, respectively) open separately into ducts (coloured green), which fuse just before entering the head capsule and con- Orthofinder performs a BlastP similarity search followed by tinue to the tip of the proboscis. Compared with the glands of E. rufibarbis,the normalization for sequence length, creation of an orthogroups glands of D. diadema are more elongated, featuring a larger volume and sub- graph, and Markov cluster algorithm clustering to sort the genes compartmentalization. The labial glands (coloured blue) are located in the mid- according to their likeliest homology relationships. The recov- dle part of the proboscis and open into the lumen between theca and the labium ered orthogroups comprise protein-coding genes that originated at the tip of the proboscis. from a single gene in the LCA of all analysed species or lineage- specific genes in a certain clade. An orthogroup can comprise several or only parts of a single gene family, which might change with the analysed taxa and the depth of the considered evolu- were ranked third (female) and fourth (male) among all iden- tionary splits. Genes without homologues in any of the included tified venom proteins (male: TPM, 4.16%; female: TPM, 3.85%, genomes cannot be assigned to orthogroups. percentage of the summed TPM value of all identified venom The final annotation of the D. diadema genome consists of proteins), while CAP-like proteins were expressed at a compa- 15,480 protein-coding genes, of which 13,981 genes were sorted rably low level in both sexes (male: TPM, 1.34%; female: TPM, into 8,878 orthogroups. The remaining 1,499 protein-coding 1.23%) (Fig. 2). We also identified 5 families of novel venom genes did not match any of the assigned orthogroups (Fig. 3a, proteins among the 30 predominant putative toxins, which Supp. File. 2). In our analysis D. diadema served as the focal or- we named asilidin11–15, according to existing robber fly toxin ganism; the origin of the protein-coding genes was inferred from nomenclature [12, 17](Figs2, 4, Supp. Table 2 and Supplemen- their first-time emergence. For instance, genes of D. diadema tary File 4). Last, we identified peptidase S1 in the venom of D. di- with homologues in the lepidopterans Bombyx mori or Danaus adema, which is also abundant in the venoms of E. rufibarbis and plexippus or both were assigned to originate in the LCA of Diptera M. arthriticus. and Lepidoptera, or earlier. Following this concept, orthogroups While we observed differences between species, there were were sorted to the considered phylogenetic splits (Fig. 3a). also a number of families with similar expression levels across The split between the Diptera and Lepidoptera lineages is the the examined species, which we define as major venom compo- oldest one considered in our analyses. These 2 clades share 84% nents of asilids. One such component is the previously described (7,471) of the orthogroups assigned to D. diadema (Fig. 3)[22], family asilidin1 (E. rufibarbis:2.4%;M. arthritcus: 2.13%; female D. meaning the ancestral versions of these protein-coding genes diadema: 1.91%; male D. diadema: 2.18%) [12]: its putative cysteine already existed in the LCA of the dipteran and lepidopteran inhibitor knot peptides (ICKs) were shown to have neurotoxic clade. Of the remaining orthogroups, 877 are unique for the effects on the European honeybee (Apis mellifera) [12]. As for E. clade of Diptera, 158 are unique for the split between the gall rufibarbis and M. arthriticus, we also identified members of the midge Mayetiola destructor and the brachyceran clade, 246 are asilidin5 family and MBF2-domain–like proteins in the venom of unique for Brachycera, and 110 orthogroups are shared only be- D. diadema. However, the 2 most dominantly expressed venom tween the 2 robber flies (Fig. 3a). Sixteen orthogroups are con- gland protein families for all species are asilidin2 and asilidin3, stituted of protein-coding genes found exclusively in D. diadema which account for 75% (M. arthriticus), 75% (male D. diadema), 83% (Fig. 3a). (female D. diadema), and 86% (E. rufibarbis) of the toxin-assigned The venom gland proteins identified via proteomics were TPM values (Fig. 2). sorted to their associated orthogroups. We then tested whether the non-toxic ancestral version of a putative toxin was already Genome data quality and completeness present in the protein-coding genome of the LCA of the com- pared species, or whether the protein is a unique novelty for To assess the evolutionary origin of the venom proteins of D. di- a certain clade. A total of 109 orthogroups, which were already adema, we combined the protein-coding genome of high-quality present in the LCA of Lepidoptera and Diptera, are associated genomes from Diptera and Lepidoptera with our venomic data with ≥1 venom protein of the female and male D. diadema.Three from the female and male specimen of D. diadema (Table 1; Supp. orthogroups with venom proteins were unique to each of Diptera 4 Toxins from scratch? Dynamic venom evolution in dipteran insects

Asilidin 2 Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019

Asilidin3

Asilidin4

Eutolmus rufibarbis Asilidin5

Asilidin1

Asilidin2 Asilidin3

Machimus arthriticus Asilidin4 Asilidin1 Asilidin9

Asilidin2 Asilidin3 Dasypogon diadema

(female) Asilidin1 Asilidin5

Asilidin2

Asilidin3

Dasypogon diadema (male) Asilidin1 Asilidin5

Fig. 2: Relative expression of putative toxin families in Dasypogon diadema (male and female), compared to Eutolmus rufibarbis and Machimus arthriticus. The expression levels of protein families secreted in the venom glands are given in percent. Only sequences with matches from proteomics and a threshold >1 transcripts per million (TPM) are included. Protein classes with an expression value <1% of the depicted TPM are summarized in the category “others.”

Table 1: Overview of all analysed genomes and their gene completeness

No. of BUSCO Order Species analysed CDSs completeness (%)

Lepidoptera Bombyx mori 14,623 84.5 ∗∗∗ Danaus plexippus 15,128 94.8 ∗∗∗ Culex quinquefasciatus 19,032 89.9 ∗∗∗ Aedes aegypti 17,158 95.5 ∗∗∗ Anopheles gambiae 14,916 98.6 ∗∗∗ Anopheles darlingi 10,519 90.1 ∗∗∗ Mayetiola destructor 22,410 86.7 ∗∗∗ Diptera Dasypogon diadema 15,480 91.1 ∗ Proctacanthus 10,942 96.7 ∗∗ coquillettiii Drosophila grimshawi 19,429 99.4 ∗∗∗ Drosophila melanogaster 30,429 99.7 ∗∗∗ Drosophila simulans 24,119 99.2 ∗∗∗ Teleopsis dalmanni 16,570 68.9 ∗∗∗ Lucilia cuprina 14,452 91.7 ∗∗∗

To infer the quality of the annotation, a BUSCO analysis was performed using the transcriptome mode and the holometabolous dataset. ∗Genome was sequenced and annotated for this study. ∗∗Genome from Dikow et al. [16] was reannotated. ∗∗∗Protein dataset from ENSEMBL. The order of the species in this table matches the species order in the cladogram in Fig. 3a. Drukewitz et al. 5 Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019

Fig. 3: (a) Phylogenetic relationships of the included taxa. Dasypogon diadema was used as the focal species for the analyses of the orthogroups. Boxes on the split show the number of orthogroups shared by D. diadema and the respective clade of the split (upper number: number of shared orthogroups; middle number: number of orthogroups with putative toxins; lower number: number of orthogroups associated with the 30 predominant putative toxins). (b) Heat map showing the expression level (TPM) in the 3 tissues of the putative toxins of both sexes. The white numbers in the black circle refer to the affiliated orthogroups and splitsa in3 (Vg-♂: venom gland male; Vg-♀: venom gland female; Pb-♂: proboscis male; Pb-♀: proboscis female; Bt-♂: body tissue male; Bt-♀: body tissue female). (c) Summarized expression level (TPM) of the putative toxin transcripts in the venom gland of both sexes. The white numbers in the black circle refer to the affiliated orthogroups and splits in 3a (number of putative toxins for all nodes: Node 1: 130; Node 2: 3; Node 3: 0; Node 4: 5; Node 5: 18; Node 6: 1; ∗no orthogroup: 4). and Brachycera, while 8 orthogroups with putative toxins were Evolutionary pattern of the predominant venom shared only between the 2 robber fly genomes (Fig. 3a). The ma- proteins jority of proteins identified in the venom gland can be assigned to protein-coding genes present in the orthogroups shared be- To prevent an over-interpretation of the data, the process of tween the Lepidoptera and the Diptera clade. The transcripts of venom evolution in D. diadema based on whole-genome data venom proteins assigned to orthogroups, which arise on Node was analysed by using a stricter threshold and focusing exclu- 2, Node 3, or Node 4, are expressed on a low level in the venom sively on the dominant putative toxin transcripts. For this pur- glands of both sexes. Putative toxin transcripts of Node 1, Node pose, we included only putative toxin transcripts that were de- 5, and the ones assigned to no orthogroup are expressed on a tected via proteomics, display an expression level in the venom ≥ high level in the venom glands of both sexes (Fig. 3b and c, Supp. gland of 500 TPM, and show a 4-fold higher expression level Figs 3 and 4). in the venom gland compared with the respective body tissue. Two independent tools (Segemehl and Salmon) were applied to 6 Toxins from scratch? Dynamic venom evolution in dipteran insects Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019

Fig. 4: The evolutionary pattern and the origin of the top 30 putative toxins. The node numbering refers to the nodes in Fig. 3a. Putative toxins present in Dasypogon diadema but missing in Eutolmus rufibarbis or Machimus arthriticus are coloured red. Single-copy genes: putative toxins with only 1 copy on the protein-coding genome of D. diadema; multi-copy genes∗: protein-coding genes that belong to orthogroups assembled of ≥2 protein-coding genes in D. diadema. Only 1 member of the orthogroup is present in the venom; multi-copy genes∗∗: protein-coding genes that belong to orthogroups assembled of ≥2 protein-coding genes in D. diadema.Twoormore members of the same orthogroup are present in the venom. perform the RNA quantification and to test the robustness of all members are putative toxins and are present in the venom the results [23, 24]. Both quantification approaches using identi- gland (Supp. Table 6). For 10 orthogroups, only 1 member is a cal thresholds reveal similar results. All 28 putative toxin tran- putative toxin present in the venom gland while the others are scripts identified via Segemehl were also identified with Salmon. not. The newly identified putative toxins U-Asilidin12-Dd1a, U- Salmon, however, reported 2 further transcripts that still met the Asilidin13-Dd1a, and U-Asilidin14-Dd1a are all single-copy genes, threshold. Further downstream analyses were based on the re- while the U-Asilidin11-Dd1a and U-Asilidin15-Dd1a are catego- sults from the quantification with Salmon, which resulted ina rized as multi-copy genes (Supp. Table 6). top 30 set of predominant putative toxins that are discussed fur- Members of the asilidin2 protein family are distributed across ther (Fig. 3b and c, Supp. Figs 3 and 4). 4 different orthogroups; 3 of these are shared only between D. For 3 of those top 30 predominant putative toxins (U- diadema and P. coquillettii, while the remaining 1 is shared be-

Asilidin3-Dd1a, U-Asilidin3-Dd1b, and U-Asilidin1-Dd1a) no or- tween the Lepidoptera and Diptera (Fig. 4). A similar picture is thogroup was assigned, suggesting that these genes are unique revealed in larger protein families like PS1 and chitinase-like, to D. diadema (Figs 3 and 4, Supp. File 3, Supp. Table 6). The re- for which distinct versions of putative toxin from different or- maining 27 putative toxin transcripts were distributed among thogroups were identified (Fig. 4, Supp. Table 6). 20 different orthogroups (Supp. Table 6, Supp. File 3). While 11 of these orthogroups are shared between the lepidopteran and dipteran clade, 2 orthogroups are unique to the dipteran clade, 1 Transposable elements to the brachycerans, and 6 are shared only between the asilids. Transposable elements were identified in 11 of the 30 pre- In general, 22 putative toxins can be categorized as multi-copy dominant toxins of D. diadema, including the protein fami- genes (Fig. 4). They are distributed between 15 different or- lies asilidin2, peptidase S1, chitinase, MBF2-domain, asilidin6, thogroups, each composed of ≥2 protein-coding genes of D. di- asilidin9, asilidin11, asilidin12, asilidin13, and asilidin15 (Supp. Ta- adema. Five of these groups contain 2 or more of the 30 predom- ble 7). In the dominant component asilidin2,thevariantsU- inant putative toxins. In 2 orthogroups (OG009368, OG0011154), Asilidin2-Dd1a and U-Asilidin2-Dd2a harbour transposable el- Drukewitz et al. 7

ements in the intron sequence. In contrast, no gene variants sessed their ability to recover our top 30 predominant toxins classified as asilidin3, the second most highly expressed venom identified using Trinity. Except for a few candidates, the major- Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 component, feature transposable elements. The majority of ity of the top 30 candidate toxins were recovered with identi- the transposable elements resemble retrotransposons classi- cal or highly identical sequence similarity in the additional as- fied as long terminal repeat retrotransposons of currently un- semblies. Our conclusion is therefore that the pattern of venom known groups. Other identified elements are retrotransposons protein evolution that we discuss here for the most highly ex- classified as long interspersed nuclear elements and DNA- pressed, and hence ecologically probably most important, pu- transposons classified as Mariner-like elements (Supp. Table 7). tative toxins is rather robust. (All details are shown in Supple- mentary Tables 8 and 9, and all visualized alignments comparing the contigs from different assemblers are provided in the Giga- Discussion Science data cloud). General aspects on the venom biology and composition Determining the frequency of false-negative results would require extensive additional work: specifically, using multiple D. diadema is a widely distributed robber fly that is known to other de novo assemblers on all the data to see whether any- hunt honeybees (A. mellifera) and other hymenopterans [13, 14]. thing had been missed in the Trinity assembly. In principle, be- To overpower such dangerous prey, venom with neurotoxic com- cause our toxin evolution findings were attained using analyses ponents for rapid paralysis is advantageous. Trophic specializa- on only the top 30 identified toxins, the impact of false-negative tion has also been shown to affect venom composition and even results on our findings is likely to be limited. However, if any venom apparatus morphology in other predatory venomous lin- missed (false negative) toxins have to be added to our current eages, such as snakes [25, 26] and spiders [27]. We therefore ex- top 30 toxins, our conclusions could be affected. Additional de- pected the venom composition of D. diadema to contain sub- tails on the processes of venom evolution in robber flies will also stantial differences compared to the previously studied, more be revealed by further genome data and deeper, more detailed generalist species E. rufibarbis and M. arthriticus. Indeed, their proteomic analyses of milked venom from single specimens. venoms differ in some aspects, such as the presence of chiti- nase and CAP proteins in D. diadema, which were not detected The evolution of the neurotoxic component asilidin1 in the venoms of E. rufibarbis and M. arthriticus. Similar to D. di- adema, the venom composition of the Australian robber fly Dolo- Asilidin1 peptides resemble a cystine inhibitor knot-like fold pus genitalis also appears to contain a larger fraction of enzy- (ICK), and 1 representative, U-asilidin1-Mar1a, was shown to in- matic proteins than those of E. rufibarbis and M. arthriticus [15]. duce neurotoxic effects on the European honeybee (A. mellifera)

D. genitalis venom also contained all asilidin families and major [12]. Facilitating a fast and efficient paralysis of prey, asilidin1 venom components that we discuss here [15]. Last, Asilidin2 is probably represents a biologically important component in rob- an especially highly expressed component in all asilids, includ- ber fly venom. ICK peptides have been convergently recruited ing D. genitalis. The observed slight sex-specific variation of the as neurotoxic venom components in a range of venomous lin- venom composition in our pooled samples of male and female eages, including scorpions, spiders, assassin bugs, cone snails, individuals might be explained by the known differing ecology and possibly also remipede crustaceans [33, 34, 43, 35–42]. The of males and females. However, this hypothesis is speculative identification of ancestral versions of short neurotoxins, such and requires further testing with additional replicates. as ICK peptides, that feature a conserved cysteine scaffold with In general, the venom of D. diadema shares the major com- variable positions between the cysteines remains a challenge ponents with E. rufibarbis and M. arthriticus. Additionally, the [38]. Indeed, while our complementary proteomic and transcrip- most dominant protein families in the venoms of all 3 species tomic analyses of the venom gland proteins of D. diadema re- are asilidin2 and asilidin3, and all species also express asilidin1 vealed 3 different asilidin1 variants, only 1 protein-coding gene transcripts (Fig. 2). The phylogenetic distance between E. rufibar- was detected at the genome level (U-Asilidin1-Dd1a). The U- bis and M. arthriticus (members of the larger subfamily Asilinae) Asilidin1-Dd1a gene is not a member of a gene family with sev- compared to D. diadema (representative of the subfamily Dasy- eral duplicates but represents a single-copy gene. Differences pogoninae) [16, 28] suggests that these 3 protein classes resem- in the coding sequences derived from transcriptome data thus ble a lineage-specific toxin arsenal of robber flies, a conclusion likely reflect allelic variation in specimens that had to be pooled that is corroborated by the study of Walker and colleagues [15]. for proteome and transcriptome analyses to achieve sufficient In the present study the de novo assembly of transcriptome tissue quantities. This finding highlights the possible bias of pre- data was performed using a single assembler, Trinity, which is dicting toxin diversity in data from pooled samples. one of the most established programs to assemble transcrip- tome data sets [29]. Nevertheless, de novo transcriptome assem- General patterns of venom protein evolution bly is challenging and different assembly software packages of- ten construct differing sets of transcripts. It has been shown in The evolutionary origin of the major venom proteins in D. di- snakes and scorpions that the number of assembled toxin tran- adema can be classified into 2 major categories. The first category scripts may vary depending on the chosen assembler [30]. Thus, comprises variants of both single- and multi-copy genes with applying only 1 assembler as a base for our analyses may mean ancient origin. These robber fly toxins have homologous genes that some of our putative toxins may include false-positive re- in the lepidopterans or non-asilid dipterans, and originate from sults and that we might have missed some toxins that represent ancestral protein versions, which occur in the LCA of asilids and false-negative results. the respective clade.

To avoid false-negative results and an over-interpretation of Four single-copy genes of the protein families asilidin12 our data, we used only transcripts that were recovered in the (U-Asilidin12-Dd1a), asilidin13 (U-Asilidin13-Dd1a), asilidin14 (U- proteome and then identified in the whole genome as baseline Asilidin14-Dd1a), and chitinase with homologues outside the to discuss possible toxins. We also used 2 additional transcrip- asilid clade provide examples of venom protein evolution with- tome assemblers, rnaSPAdes [31] and Trans-ABySS [32], and as- out gene duplication. These genes (13.3% of the predominant 8 Toxins from scratch? Dynamic venom evolution in dipteran insects

venom proteins) most likely feature an expression-dependent hence the evolutionary mechanisms involved in the evolution of single gene co-option–type functional recruitment. Under this protein function. Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 scenario, an up-regulation of expression in the venom gland tissue and the injection of the otherwise physiological pro- tein as a venom component might lead to a toxic effect in the Methods prey species. In contrast, putative toxins of the protein fami- Robber fly collection and sample preservation lies asilidin2, asilidin9, CAP, chitinase, peptidase S1, and MBF2- domain–like proteins are present as multi-copy genes. The re- Specimens were collected in June 2014 in France at the river- vealed pattern of 1 or more duplication events in the history of banks of the river Tetˆ north of Millas in the Departement´ these genes supports the widely proposed hypothesis of toxin Pyren´ ees-Orientales´ (Occitanie) and the vineyards around Brulatˆ evolution by gene duplication [3, 4, 44]. in the Departement´ Var (Provence-Alpes-Coteˆ d’Azur). For tran- The second category of venom proteins includes putative scriptome sequencing, samples from body tissue, thoracic gland toxins without homologues outside the asilid lineage. Multi- tissue, and proboscis tissue of 6 males and 6 females were sep- copy genes dominate this category (asilidin2, peptidase S1), al- arately dissected and preserved in RNAlater (Ambion, Thermo- though single-copy genes are also present (asilidin6). Particu- Fisher, Waltham, MA, USA). All dissected individuals were pre- larly asilidin2 shows a pattern of intense gene duplication, and served in 94% ethanol as voucher specimens. In addition, tho- several transcripts in this family from different orthogroups racic glands from 7 males and 5 females were crushed after dis- are secreted in the venom glands. These single- and multi- section in 1× phosphate-buffered saline buffer with proteinase copy genes are robber fly lineage-specific and their ancestry inhibitor tablets (Complete Ultra, ROCHE, Mannheim, Germany) is enigmatic. Intriguingly, we identified transposable elements for proteomic work. See also Supplementary Fig. 5 for the gen- in 11 venom proteins, including 2 variants of the highly ex- eral workflow. Two individuals for both sexes were deposited in pressed asilidin2. Two-thirds of the venom proteins do not show Bouin liquid to perform synchrotron-based microcomputer to- any presence of transposable elements. We can only speculate mography. here that the evolution of single toxins might be influenced by transposable elements and that this might be an explanation for the diversity of asilin2 variants. However, to provide a pro- Venom apparatus found analysis on the influence of transposable elements on The functional morphology of the venom delivery system in the evolution of venom proteins, the analysis design needs to both sexes of D. diadema was investigated using synchrotron- be adapted and whole-genome data and venom protein data of based microcomputer tomography. Bouin-preserved samples more species needs to be included. were critical point dried, mounted on a specimen holder, and scanned at the Swiss Light Source electron synchrotron acceler- ator. Morphological structures were segmented in aligned image Conclusion stacks using ITK-snap v.3.60 [45]. The visualization of the recon- structed 3D model was carried out using Blender v.2.79 [46]. The insects include several venomous lineages and comprise the greatest number of venomous species within the animal kingdom [4]. For many of these, the venom compositions and Transcriptomics putative toxins remain unknown [6]. Besides hymenopteran and heteropteran taxa, insects also harbor predatory and ven- Total RNA of thoracic glands, proboscis tissue, and body tissue omous asilid dipterans. Despite some differences between stud- was extracted following the standard protocol for Trizol Reagent ied species, our results suggest that the major components by Thermo Fisher, Waltham, MA, USA. For both sexes, the gland of asilid venom constitute new putative toxins that are likely and proboscis tissues of 6 specimens were pooled to guarantee to be restricted to asilids. These include the asilidin1 family, sufficient RNA quantity, while body tissue was extracted from which contains the recently described neurotoxic component U- 1 individual per sex. All 6 samples for male and female D. di- asilidin1-Mar1a, and has been identified in all 4 studied asilid adema specimens were prepared for sequencing at the Core Unit venoms, including D. diadema (U-Asilidin1-Dd1a) [12, 15]. DNA Technologies of the University of Leipzig using the Illumina The present study includes the currently most comprehen- poly-A selection protocol. Sequencing was performed on the Il- sive species set of genomes to assess the evolution of venom lumina HiScanSQ platform with 100 bp paired-end reads (Supp. proteins in D. diadema as a representative in the previously un- Table 5). All generated data are accessible via the BioProject PR- covered dipteran lineage of robber flies. Our analysis is further JNA361480, including all BioSample and SRA entries (see also strengthened by the implementation of gene-sets from model Supp. Table 4). In addition to our own data, all available asilid organisms and closely related species, maximizing our ability transcriptomes were mined in the SRA archive for later genome to detect toxin homologues and identify the processes that un- annotation (Supp. Table 4). All transcriptome raw reads were derlie their evolution. This approach revealed that the processes processed in the same way after visual inspection in FastQC [47]. that contribute to the evolution of toxins in D. diadema venom Quality filtering and trimming was then applied in Trimmomatic are multimodal and include (i) expression-depending co-option v.033 with a minimum length of 60 bp and a minimum phred of housekeeping genes, (ii) neofunctionalization after gene du- scoreof30[48]. All pre-processed datasets were finally assem- plication events, and (iii) highly expressed lineage-specific or- bled using Trinity v.2.4 with default settings except a minimum phan genes. Intriguingly, several of these lineage-specific genes contig length of 138 [29]. The transcript abundance in all D. di- of venom proteins remain of enigmatic origin. The role of these adema tissue samples was estimated by mapping the trimmed orphan genes as possible drivers in venom evolution represents RNA reads with Segemehl (alignment accuracy, 98%) [24, 49]and an intriguing topic for future studies. Our findings highlight the by comparatively quantifying reads with Salmon (default set- value of studying neglected venomous lineages to improve our tings). The TPM (transcripts per million) values for each cod- understanding of the evolution of venoms and their toxins, and ing domain sequence were visualized with a customized Python Drukewitz et al. 9

script and the Seaborn package; see also Identification of venom spected in FastQC [47] and then quality filtered and trimmed ap- proteins section. plying Trimmomatic v.033 with a minimum length of 70 bp and a Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 minimum phred score of 30 [48]. An overview of sequenced raw Proteomics reads and processed transcripts is given in Table 2. The genome assembly was performed with Maryland Super The lyophilized venom from the thoracic glands preserved in Read Cabog Assembler (MaSuRCA) v.3.1.3 with the linking mates proteinase inhibitor was dissolved in water and prepared for option set to 1 and the cgwErrorRate set to 0.15; all other op- proteomic analysis as described in Drukewitz et al. [12]. Briefly, tions were default [52]. To inspect the quality and to exclude the samples were desalted by means of acetone precipita- possible contamination Blobtools was applied [53]. The final as- tion, proteins reduced with dithiotheitol, alkylated with iodoac- sembly resulted in an overall assembly size of 450 Mb (scaffold etamide, and digested by overnight incubation with trypsin. > 2 kb), with an N50 of 32.6 kb and a guanine-cytosine con- The digested venom was desalted using a C18 ZipTip (Thermo tent of 35.81%. Assembly size, N50 value, and other statistics Fisher, Waltham, MA, USA), dried in a vacuum centrifuge, and were assessed with Quast v.4.6 [54]. The final genome size is dissolved in 0.5% formic acid before 2 μg of each sample was in line with the prior estimated size via k-mer distribution us- analysed by liquid chromatography with tandem mass spec- ing jellyfish [55], which resulted in 427 Mb (Supp. Fig. 2). The trometry (MS/MS) on an AB Sciex 5600TripleTOF (Framingham, assessment with BUSCO (genome mode, holometabolous core MA, USA) equipped with a Turbo-V source heated to 550◦Cand gene set) resulted in 92.4% completeness and a duplication rate coupled to a Shimadzu Nexera UHPLC (Kyoto, Japan). The di- of 2.7%, which indicates a high quality of the draft genome of gested venom was fractionated with an Agilent Zorbax stable- D. diadema and that the heterozygous areas were adequately bond C18 column (2.1 × 100 mm, 1.8 μm particle size, 300 A˚ assembled [20]. pore size), across a gradient of 1–40% solvent B (90% acetonitrile, 0.1% formic acid) in 0.1% formic acid over 60 min, using a flow Genome annotation rate of 180 μL/min (all solvent concentrations are in volume to volume). MS1 survey scans were acquired at 300–1,800 m/z over Our genome sequence of D. diadema was co-annotated with the 250 ms, and the 20 most intense ions with a charge of +2to+5 recently published genome of P. coquillettii using the Maker2 and an intensity of ≥120 counts/s were selected for MS2. The pipeline [16, 56]. All de novo assembled trancriptome data sets unit mass precursor ion inclusion window was ±0.7 Da, and iso- were then utilized to identify splice sites using Exonerate [57] topes within ±2 Da were excluded from MS2, which scans were (Supp. Table 3). Additionally, the protein sequences of Aedes ae- acquired at 80–1400 m/z over 100 ms and optimized for high gypti, Anopheles gambiae, Mayetiola destructor, Lucilia cuprina,and resolution. Drosophila melanogaster from the ENSEMBL genome database and For protein identification, MS/MS spectra were searched all insect proteins from the Swissprot database were aligned us- against sequence lists consisting of both the translated venom ing BLAST+ v.2.6.0. Successful aligned positions were extracted gland and body transcriptomes of D. diadema using Protein- to train the gene prediction software Augustus and SNAP [21, 58– Pilot v5.0 (AB Sciex, Framingham, MA, USA). Searches were run 60]. The resulting Maker2 gene set after 4 iterative training cycles as thorough identification searches, specifying urea denatura- was finally used for further downstream analyses. The annota- tion, tryptic digestion, and cysteine alkylation by iodoacetamide. tion resulted in 10,942 protein-coding genes in the genome of P. Amino acid substitutions and biological modifications were al- coquillettii and 15,480 protein-coding genes in the genome of D. lowed in order to identify potential post-translational modifica- diadema. The completeness of both gene sets was inferred with tions and to account for chemical modifications due to experi- BUSCO [20] (transcriptome mode, holometabolous core gene set) mental artefacts. Decoy-based false-discovery rates (FDRs) were and resulted in a completeness of 91.1% for D. diadema and 96.7% estimated by ProteinPilot, and for our protein identification we for P. coquillettii (Table 1). used a protein confidence cut-off corresponding to a local FDR of <0.5%. Spectra were also manually examined to further elim- Identification of transposable elements inate any false-positive results. Repetitive elements in the genome of D. diadema and P. coquillettii were identified using RepeatModeler (v. open-1.0.11); the result- Genome sequencing and assembly ing repeat library was provided to RepeatMasker (v. open-4.07) [61, 62] to mask repetitive elements prior to the annotation of DNA was extracted from 30 mg of muscle tissue of a female genes. For D. diadema the repeatmasker output was parsed with specimen of D. diadema. The tissue was dissolved in 500 μL lysis the “One code to find them all” perl tool [63] using the “strict” op- buffer (10mM Tris-HCl pH 8, 0.5% [w/v] sodium dodecyl sulfate, tion. The resulting overview tables were used to analyse the ap- 2.4 mg/mL proteinase K, 1mM ethylenediaminetetraacetic acid pearance of transposable elements in the top 30 dominant tox- ◦ [EDTA] pH 8) for 50 min at 50 C while shaking. Chitinous debris ins (Supp. Table 7). was spun down in a table centrifuge, and the DNA was extracted from the supernatant using MinElute silica spin columns (MinE- Identification of venom proteins lute PCR Purification Kit, Qiagen, Hilden, Germany) according to the manufacturer’s specifications. Two aliquots of 3 μg isolated Putative toxins and venom protein families were identified by DNA were sheared to 200 and 400 bp average length in a Covaris applying the approach described in Drukewitz et al. [12]. The S220 Focused Ultrasonicator (200 bp settings: 10 dc, 5 i, 200 cpb, strategies for transcriptomics were to perform BlastP searches fs 180 s; 400 bp settings: 10 dc, 4 i, fs 55 s). 100 ng sonicated DNA against ToxProt, to run hidden Markov model searches using served as input for library preparation as described by Meyer and HMMER v.3.1b2 [64] against our own venom protein databases, Kircher [50]. Both libraries were double-indexed with 2 unique and to characterize highly expressed coding regions. The ma- barcodes of 7 bp and amplified as described by Kircher et al. jor difference in the present analysis is that coding domain re- [51]. Paired-end reads were subsequently sequenced with 150 bp gions used to identify putative toxins are not derived from de on an Illumina MiSeq platform. All raw reads were visually in- novo transcripts but instead based on genome loci that were an- 10 Toxins from scratch? Dynamic venom evolution in dipteran insects

Table 2: Overview of DNA libraries generated for the Dasypogon diadema genome assembly Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 Fragment length No. of sequenced Theoretical genome Library name (nt) read pairs coverage (fold)

D1130 200 9,119,970 5 D1131 400 167,137,385 89

Number of read pairs and fragment size of the libraries used for the genome assembly are shown. The theoretical genome coverage was calculated with a genome size estimate of 450 Mb and a read length of 120 nt (nucleotides) after processing. notated by transcriptome and proteome sequences. The anno- Use of additional assemblers to assess the top 30 tated protein-coding genes of D. diadema were matched with the predominant toxins venom gland proteins identified via proteomics applying a strict threshold (e-value of 1e−40, query coverage of 90%). This cut-off Venom gland transcriptome datasets of both sexes were addi- was used to reduce false-positive results while at the same time tionally assembled using the assembler rnaSPAdes v.3.13.0 [31] minimizing the number of protein-coding genes that might be and Trans-ABySS v.2.0.1 [32]. Both assemblers were used with missed. The transcript abundance in all D. diadema tissue sam- the default settings; on those settings rnaSPAdes uses a k-mer ples was estimated on the basis of the trimmed RNA reads by length of 21 and Trans-ABySS a k-mer length of 32. The open applying the quantification tool Salmon (default settings) and reading frames from the initial Trinity assembly and the addi- the read mapper Segemehl (alignment accuracy, 98%) [24, 49]. To tionally provided rnaSPAdes and Trans-ABySS assemblies were assess evolutionary processes of putative toxins a rigorous TPM extracted using Transdecoder v.5.5.0 [69]. Protein sequences of value of 500 and a 4-fold higher expression level in the venom the initial trinity assembly, which are verified via our proteomic gland compared to the respective body tissue was picked to pre- analysis and associated with 1 of the top 30 predominant pro- vent over-interpretation of our data. teins, were used as a query for a BlastP search in the protein se- Additionally, a second threshold with a lower TPM value (>1) quences of the rnaSPAdes and Trans-ABySS assembly. The pro- was applied to allow a comparison of the identified venom pro- tein sequence of the best hit was extracted and aligned with the teins to previously published robber fly data [12]. Proteins with query sequence using mafft-ginsi. The resulting alignment was a housekeeping function, a low expression level in the venom visualized using Jalview [70]. glands, and a high expression level in non-venom gland tissue were not considered as putative toxins and excluded from the analysis. Availability of supporting data and materials All transcriptome and genome data are available in NCBI via Venom evolution reconciled by genomics the Bioproject on robber fly venom evolution, PRJNA361480. Transcriptome raw data of male and female venom gland, The ENSEMBL database provides 21 annotated dipteran body, and proboscis tissue are published with the SRA entries genomes [21]; 12 of these are from Drosophila species. For SRR7754486, SRR7754485, SRR5192548, SRR5192547, SRR7754488, Drosophila, only 3 representative genomes were selected for our SRR7754487. The genome assembly is accessible in GenBank un- analyses (Table 1). Otherwise all available taxa were included, der QYTT00000000; the sequencing raw data are stored in the with 2 exceptions. The wingless antarctic midge Belgica antarc- SRA with the 2 accession numbers SRR7878513 and SRR7878512. tica was excluded because of its extremely derived lifestyle. The mass spectrometry proteomics data have been deposited to Megaselia scalaris was excluded because of the rather experi- the ProteomeXchange Consortium via the PRIDE partner repos- mental approach that was used to sequence its genome [65, itory with the dataset identifier PXD013358. Other data further 66]. The lepidopterans B. mori and D. plexippus were chosen as supporting this work are available in the GigaScience repository, outgroup taxa [67, 68]. Apart from ENSEMBL we also mined NCBI GigaDB [71]. for relevant dipteran genomes, and consequently re-annotated and included the genome of P. coquillettii (Supp. Table 3) [16]. Additional files The protein sets of all analysed genome species were compared and protein-coding genes assigned to orthogroups Supplementary Table 1: Overview of available genome species with Orthofinder [18]. Depending on the taxon samplings, or- with analysed venom thogroups can comprise gene families, gene classes, or only Supplementary Table 2: Overview of protein families identified parts of such classification. The aim of the approach is notto in the venom of D.diadema identify such hierarchical classes but to infer the homology of Supplementary Table 3: NCBI Accession to the included the analysed protein sets [18, 19]. Under the assumption that genomes orthogroups only arise 1 time but might be lost several times, Supplementary Table 4: SRA-archive accession numbers of used the origin of novelties and the expansion of protein groups can RNAandDNAdata be analysed. D. diadema wasusedasthefocalspecies,which Supplementary Table 5: Overview of RNA sequencing and pro- means that only the orthogroups present in this species were cessing of D.diadema analysed further. An orthogroup is considered as present in Supplementary Table 6: Overview of the predominant top 30 the LCA of D. diadema and a clade when members of the or- venom proteins and their associated orthogroups thogroup were present in the genome of D. diadema and in ≥1 Supplementary Table 7: Overview of transposable elements in representative of the analysed clade. Shared orthogroups were the predominant top 30 venom proteins counted using the Orthofinder output and a customized Python Supplementary Table 8: Overview alternative assembled tran- script. scripts - RNASpades Drukewitz et al. 11

Supplementary Table 9: Overview alternative assembled tran- time at the Paul Scherer Institute, Villigen, Switzerland was scripts - Transabyss provided to B.M.v.R. based on the proposal “Evolution of ven- Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 Supplementary Figure 1: Phylogenetic relationship of the anal- oms and venom delivery systems of neglected venomous euar- ysed species thropod and annelid taxa” (ID 20160644). We acknowledge the Supplementary Figure 2: Histogram of k-mer distribution Paul Scherer Institute, Villigen, Switzerland for provision of syn- Supplementary Figure 3: Summarized expression level of puta- chrotron radiation beamtime at the TOMCAT beamline X02DA tive toxins of the SLS and would like to thank Goran Lovric for assistance. Supplementary Figure 4: Comparison of heatmaps based on the B.M.v.R. and S.H.D. conducted this work within the Animal Ve- RNA-Seq quantification nomics working group at the Fraunhofer Institute for Molecular Supplementary Figure 5: Overview of the analysis workflow Biology and Applied Ecology, Giessen. We acknowledge Alessan- Supplementary File 2: Overview table of sorted orthogroups for dra Dupont for commenting and editing of the manuscript and all included species Sebastien´ Dutertre at the University of Montpellier (France) for Supplementary File 3: Overview table of the 30 predominant tox- initial collaboration in proteomics. ins Supplementary File 4: Alignments of identified protein families References

Abbreviations 1. Nei M, Gu X, Sitnikova T. Evolution by the birth-and-death bp: base pairs; BLAST: Basic Local Alignment Search Tool; process in multigene families of the vertebrate immune sys- BUSCO: Benchmarking Universal Single-Copy Orthologs; CAP: tem. Proc Natl Acad Sci U S A 1997;94:7799–806. catabolite gene activator protein; FDR: false-discovery rate; ICK: 2. Lynch M. The evolutionary fate and consequences of dupli- inhibitor knot peptide; kb: kilobase pairs; LCA: last common an- cate genes. Science 2002;290:1151–5. cestor; Mb: megabase pairs; MS/MS: tandem mass spectrome- 3. Casewell NR, Wuster¨ W, Vonk FJ, et al. Complex cock- try; NCBI: National Center for Biotechnology Information; nt: nu- tails: The evolutionary novelty of venoms. Trends Ecol Evol cleotides; SRA: Sequence Read Archive; TPM: transcripts per mil- 2013;28:219–29. lion. 4. Fry BG, Roelants K, Champagne DE, et al. The toxicogenomic multiverse: Convergent recruitment of proteins into animal venoms. Annu Rev Genomics Hum Genet 2009;10:483–511. Competing interests 5. von Reumont BM. Studying smaller and neglected or- ganisms in modern evolutionary venomics implementing The authors declare that they have no competing interests. RNASeq (transcriptomics)—A critical guide. Toxins (Basel) 2018;10:E292. Funding 6. von Reumont BM, Campbell L, Jenner R. Quo vadis ven- omics? A roadmap to neglected venomous invertebrates. SHD is funded by a scholarship (Doktorandenforderplatz)¨ from Toxins (Basel) 2014;6:3488–551. the University of Leipzig. B.M.v.R. was supported for this work 7. Vonk FJ, Casewell NR, Henkel CV, et al. The king cobra by the German Science Foundation (DFG RE3454/4–1). This work genome reveals dynamic gene evolution and adaptation was supported by the Australian Research Council (DECRA Fel- in the snake venom system. Proc Natl Acad Sci U S A lowship grant No. DE160101142 and Discovery Project grant No. 2013;110:20651–6. DP160104025 to E.A.B.U.). 8. CaoZ,YuY,WuY,etal.ThegenomeofMesobuthus martensii reveals a unique adaptation model of . Nat Com- Authors’ contributions mun 2013;4:1–10. 9. Sanggaard KW, Bechsgaard JS, Fang X, et al. Spider genomes S.H.D. and B.M.v.R. conceived the project and designed the anal- provide insight into composition and evolution of venom yses. S.H.D. and B.M.v.R. performed specimen collection, dissec- and silk. Nat Commun 2014;6(5):3765. tion, and transcriptomic and genomic analyses. E.A.B.U. con- 10. Wong ESW, Papenfuss AT, Whittington CM, et al. A lim- ducted the proteomic analyses. L.B. performed all laboratory ited role for gene duplications in the evolution of platypus work for the genome sequencing. S.H.D. and B.M.v.R. wrote the venom. Mol Biol Evol 2012;29:167–77. manuscript with input from all authors. 11. Martinson EO, Mrinalini , Kelkar YD, Chang CH, et al. The evolution of venom by co-option of single-copy genes. Curr Acknowledgments Biol 2017;27:2007–2013.e8. 12. Drukewitz SH, Fuhrmann N, Undheim EAB, et al. A dipteran’s B.M.v.R. thanks Fritz Geller-Grimm for helpful discussions and novel sucker punch: Evolution of atypical venom information on species biology and localities. Sabrina Simon with a neurotoxic component in robber flies (Asilidae, and students from the University of Wageningen, and Alessan- Diptera). Toxins (Basel) 2018;10:E29. dra Dupont further assisted in finding and collecting specimens. 13. Geller-Grimm F, Autokologische¨ Studien an Raubfliegen S.H.D. and B.M.v.R. thank in particular Martin Schlegel for his (Diptera : Asilidae) auf Binnendunen¨ des Oberrheintal- support at the Institute of Biology at the University of Leipzig. grabens. 1995, Technische Hochschule Darmstadt. B.M.v.R. especially thanks Matthias Meyer at the Max Planck In- 14. Poulton EB, XVI. Predaceous insects and their prey. Trans R stitute for Evolutionary Anthropology in Leipzig for the fruitful Entomol Soc London 1907;54:323–410. collaboration, also with his team. Computational analyses were 15. Walker AA, Dobson J, Jin J, et al. Buzz kill: Function and partly performed on the High Performance Computing Cluster proteomic composition of venom from the giant assas- EVE at the UFZ Leipzig, and S.H.D. and B.M.v.R. thank Chris- sin fly Dolopus genitalis (Diptera: Asilidae). Toxins (Basel) tian Krause for his help regarding some analyses set-ups. Beam- 2018;10:456. 12 Toxins from scratch? Dynamic venom evolution in dipteran insects

16. Dikow RB, Frandsen PB, Turcatel M, et al. Genomic and odine receptors) of cardiac and skeletal muscle. J Gen Physiol

transcriptomic resources for assassin flies including the 2002;111:679–90. Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 complete genome sequence of Proctacanthus coquilletti (In- 36. Wang X hong, Smith R, Fletcher JI, et al. Structure- secta: Diptera: Asilidae) and 16 representative transcrip- function studies of ω-atracotoxin, a potent antagonist tomes. PeerJ 2017;5:e2951. of insect voltage-gated calcium channels. Eur J Biochem 17. Undheim EAB, Jones A, Clauser KR, et al. Clawing through 1999;264:488–94. evolution: Toxin Diversification and convergence in the 37. von Reumont BM, Blanke A, Richter S, et al. The first ven- ancient lineage Chilopoda (centipedes). Mol Biol Evol omous crustacean revealed by transcriptomics and func- 2014;31:2124–48. tional morphology: Remipede venom glands express a 18. Emms DM, Kelly S. OrthoFinder: Solving fundamental bi- unique toxin cocktail dominated by enzymes and a neuro- ases in whole genome comparisons dramatically improves toxin. Mol Biol Evol 2014;31:48–58. orthogroup inference accuracy. Genome Biol 2015;16:157. 38. Undheim EAB, Mobli M, King GF. Toxin structures as evolu- 19. Paps J, Holland PWH. Reconstruction of the ancestral meta- tionary tools: Using conserved 3D folds to study the evolu- zoan genome reveals an increase in genomic novelty. Nat tion of rapidly evolving peptides. Bioessays 2016;38:539–48. Commun 2018;9(1):1730. 39. Walker AA, Madio B, Jin J, et al. Melt with this kiss: Paralyz- 20. Simao˜ FA, Waterhouse RM, Ioannidis P, et al. BUSCO: Assess- ing and liquefying venom of the assassin bug Pristhesancus ing genome assembly and annotation completeness with plagipennis. Mol Cell 2017;16:552–66. single-copy orthologs. Bioinformatics 2015;31:3210–2. 40. Mayhew ML, King GF, Jin J, et al. The assassin bug Pristhe- 21. Hubbard T, Barker D, Birney E, et al. The Ensembl genome sancus plagipennis produces two distinct venoms in separate database project. Nucleic Acids Res 2002;30:38–41. gland lumens. Nat Commun 2018;9(1):755. 22. Misof B, Liu S, Meusemann K, et al. Phylogenomics re- 41. Pineda SS, Undheim EAB, Rupasinghe DB, et al. Spider ve- solves the timing and pattern of insect evolution. Science nomics: Implications for drug discovery. Future Med Chem 2014;346:763–7. 2014;6:1699–714. 23. Otto C, Stadler PF, Hoffmann S. Lacking alignments? The 42. Herzig V, King GF. The cystine knot is responsible for next-generation sequencing mapper segemehl revisited. the exceptional stability of the insecticidal spider toxin ω- Bioinformatics 2014;30:1837–43. Hexatoxin-Hv1a. Toxins (Basel) 2015;7:4366–80. 24. Patro R, Duggal G, Love MI, et al. Salmon provides fast 43. von Reumont BM, Undheim E, Jauss R-T, et al. Venomics and bias-aware quantification of transcript expression. Nat of remipede crustaceans reveals novel peptide diversity Methods 2017;14:417–9. and illuminates the venom’s biological role. Toxins (Basel) 25. Daltry JC, Wuster¨ W, Thorpe RS. Diet and snake venom evo- 2017;9:234. lution. Nature 1996;379:537–40. 44. Hargreaves AD, Swain MT, Hegarty MJ, et al. Restriction and 26. Li M, Fry BG, Kini RM. Eggs-only diet: Its implications for the recruitment-gene duplication and the origin and evolution toxin profile changes and ecology of the marbled sea snake of snake venom toxins. Genome Biol Evol 2014;6:2088–95. (Aipysurus eydouxii). J Mol Evol 2005;60:81–9. 45. Yushkevich PA, Piven J, Hazlett HC, et al. User-guided 3D 27. Pekar´ S, Bocˇanek´ O, Michalek´ O, et al. Venom gland size active contour segmentation of anatomical structures: Sig- and venom complexity - essential trophic adaptations of nificantly improved efficiency and reliability. Neuroimage venomous predators: a case study using spiders. Mol Ecol 2006;31:1116–28. 2018;27:4257–69. 46. Blender Foundation. Blender. https://www.blender.org/. Ac- 28. Dikow T. A phylogenetic hypothesis for Asilidae based on a cessed on 8 January 2018. total evidence analysis of morphological and DNA sequence 47. Andrews S. FastQC. A quality control tool for high through- data (Insecta: Diptera: Brachycera: Asiloidea). Org Divers Evol put sequence data. 2015. https://www.bioinformatics.babra 2009;9:165–88. ham.ac.uk/projects/fastqc/. Accessed on 7 February 2018. 29. Raychowdhury R, Gnirke A, Fan L, et al. Full-length tran- 48. Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexi- scriptome assembly from RNA-Seq data without a reference ble trimmer for Illumina sequence data. Bioinformatics genome. Nat Biotechnol 2011;29:644–52. 2014;30:2114–20. 30. Holding ML, Margres MJ, Mason AJ, et al. Evaluating the per- 49. Hoffmann S, Otto C, Kurtz S, et al. Fast mapping of short formance of de novo assembly methods for venom-gland sequences with mismatches, insertions and deletions using transcriptomics. Toxins (Basel) 2018;10:249. index structures. PLoS Comput Biol 2009;5:e1000502. 31. Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome 50. Meyer M, Kircher M. Illumina sequencing library preparation assembly algorithm and its applications to single-cell se- for highly multiplexed target capture and sequencing. Cold quencing. J Comput Biol 2012;19:455–77. Spring Harb Protoc 2010;5:pdb.prot5448. 32. Simpson JT, Wong K, Jackman SD, et al. ABySS: a paral- 51. Kircher M, Sawyer S, Meyer M. Double indexing overcomes lel assembler for short read sequence data. Genome Res inaccuracies in multiplex sequencing on the Illumina plat- 2009;19:1117–23. form. Nucleic Acids Res. 2012;40:e3. 33. Corzo G, Adachi-Akahane S, Nagao T, et al. Novel pep- 52. Zimin AV, Marc¸ais G, Puiu D, et al. The MaSuRCA genome tides from assassin bugs (Hemiptera: Reduviidae): Isola- assembler. Bioinformatics 2013;29:2669–77. tion, chemical and biological characterization. FEBS Lett 53. Laetsch DR, Blaxter ML. BlobTools: Interrogation of genome 2001;499:256–61. assemblies. F1000 Res 2017;6:1287. 34. Fletcher JI, Smith R, O’Donoghue SI, et al. The structure of a 54. Gurevich A, Saveliev V, Vyahhi N, et al. QUAST: Qual- novel insecticidal neurotoxin, ω-atracotoxin-HV1, from the ity assessment tool for genome assemblies. Bioinformatics venom of an Australian funnel web spider. Nat Struct Biol 2013;29:1072–5. 1997;4:559–66. 55. Marc¸ais G, Kingsford C. A fast, lock-free approach for effi- 35. Tripathy A, Meissner G, Resch W, et al. Imperatoxin a induces cient parallel counting of occurrences of k-mers. Bioinfor- subconductance states in Ca 2+ release channels (ryan- matics 2011;27:764–70. Drukewitz et al. 13

56. Holt C, Yandell M. MAKER2: An annotation pipeline and 65. Kelley JL, Peyton JT, Fiston-Lavier AS, et al. Compact genome

genome-database management tool for second-generation of the Antarctic midge is likely an adaptation to an extreme Downloaded from https://academic.oup.com/gigascience/article-abstract/8/7/giz081/5530325 by Max Planck Institut Fuer Evolutionaere Anthropologie user on 12 September 2019 genome projects. BMC Bioinformatics 2011;12:491. environment. Nat Commun 2014;5:4611. 57. Slater GSC, Birney E. Automated generation of heuristics 66. Rasmussen DA, Noor MAF. What can you do with 0.1x for biological sequence comparison. BMC Bioinformatics genome coverage? A case study based on a genome survey 2005;6:31. of the scuttle fly Megaselia scalaris (Phoridae). BMC Genomics 58. Korf I. Gene finding in novel genomes. BMC Bioinformatics 2009;10:382. 2004;5:59. 67. Zhan S, Merlin C, Boore JL, et al. The monarch butterfly 59. Stanke M, Steinkamp R, Waack S, et al. AUGUSTUS: A web genome yields insights into long-distance migration. Cell server for gene finding in eukaryotes. Nucleic Acids Res 2011;147:1171–85. 2004;32:W309–12. 68. Xia Q, Zhou Z, Lu C, et al. A draft sequence for the 60. Bairoch A. The SWISS-PROT protein sequence database genome of the domesticated silkworm (Bombyx mori). Sci- and its supplement TrEMBL in 2000. Nucleic Acids Res ence 2004;306:1937–40. 2000;28:45–8. 69. TransDecoder (Find Coding Regions Within Transcripts). ht 61. Smit A, Hubley R. RepeatModeler. http://www.repeatmasker tps://github.com/TransDecoder. Accessed 7 January 2018. .org/RepeatModeler/. Accessed on 6 March 2018. 70. Waterhouse AM, Procter JB, Martin DMA, et al. Jalview Ver- 62. Tarailo-Graovac M, Chen N. Using RepeatMasker to iden- sion 2–A multiple sequence alignment editor and analysis tify repetitive elements in genomic sequences. Curr Protoc workbench. Bioinformatics 2009;25:1189–91. Bioinformatics 2009;4:Unit 4.10. 71. Drukewitz SH, Bokelmann L, Undheim EAB, et al. Support- 63. Bailly-Bechet M, Haudry A, Lerat E. “One code to find them ing data for “Toxins from scratch? Diverse, multimodal gene all”: A perl tool to conveniently parse RepeatMasker output origins in predatory robber flies indicate dynamic venom files. Mob DNA BioMed Central 2014;5:13. evolution in dipteran insects.” GigaScience Database 2019. 64. HMMER: Biosequence analysis using profile hidden Markov http://dx.doi.org/10.5524/100612. models. http://hmmer.org/. Accessed on 6 March 2018.