The genome of the tapeworm erinaceieuropaei isolated from the biopsy of a migrating brain lesion Bennett et al.

Bennett et al. Genome Biology 2014, 15:510 http://genomebiology.com/2014/15/11/510 Bennett et al. Genome Biology 2014, 15:510 http://genomebiology.com/2014/15/11/510

RESEARCH The genome of the sparganosis tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion Hayley M Bennett1*, Hoi Ping Mok3, Effrossyni Gkrania-Klotsas3, Isheng J Tsai1,8, Eleanor J Stanley1,9, Nagui M Antoun4, Avril Coghlan1, Bhavana Harsha1, Alessandra Traini1, Diogo M Ribeiro1, Sascha Steinbiss1, Sebastian B Lucas7, Kieren SJ Allinson2, Stephen J Price5, Thomas S Santarius5, Andrew J Carmichael3, Peter L Chiodini6, Nancy Holroyd1, Andrew F Dean2† and Matthew Berriman1†

Abstract Background: Sparganosis is an infection with a larval tapeworm. From a rare cerebral case presented at a clinic in the UK, DNA was recovered from a biopsy sample and used to determine the causative species as Spirometra erinaceieuropaei through sequencing of the cox1 gene. From the same DNA, we have produced a draft genome, the first of its kind for this species, and used it to perform a comparative genomics analysis and to investigate known and potential tapeworm drug targets in this tapeworm. Results: The1.26GbdraftgenomeofS. erinaceieuropaei is currently the largest reported for any . Through investigation of β-tubulin genes, we predict that S. erinaceieuropaei larvae are insensitive to the tapeworm drug albendazole. We find that many putative tapeworm drug targets are also present in S. erinaceieuropaei, allowing possible cross application of new drugs. In comparison to other sequenced tapeworm species we observe expansion of protease classes, and of Kuntiz-type protease inhibitors. Expanded gene families in this tapeworm also include those that are involved in processes that add post-translational diversity to the protein landscape, intracellular transport, transcriptional regulation and detoxification. Conclusions: The S. erinaceieuropaei genome begins to give us insight into an order of tapeworms previously uncharacterized at the genome-wide level. From a single clinical case we have begun to sketch a picture of the characteristics of these organisms. Finally, our work represents a significant technological achievement as we present a draft genome sequence of a rare tapeworm, and from a small amount of starting material.

Background infections, such as those with Spirometra erinaceieuropaei, Tapeworms affect the lives of millions worldwide. Of is scarce. those, the debilitating or potentially deadly Compared with more common human-infective tape- and are priority targets for the World worms, S. erinaceieuropaei has an even more complex life Health Organization [1]. The availability of genomes of cycle (Figure 1) involving a minimum of three hosts for the major disease-causing species Echinococcus spp. and completion. Spirometra spp. are found worldwide but hu- have heralded the way for increased re- man infections are most often reported in Asian countries, search progress and new venues for intervention [2,3]. typically China, South Korea, Japan and Thailand, al- However, molecular knowledge regarding rarer tapeworm though several recent travel and migration-related cases of sparganosis have occurred in Europe [4,5]. The infective stage for humans is a motile, secondary larval form known * Correspondence: [email protected] as the sparganum. Infection can occur through the inges- † Equal contributors tion of raw tadpoles, the consumption of undercooked 1Wellcome Trust Sanger Institute, Parasite Genomics, Cambridge CB10 1SA, UK frogs or snakes, or the use of frog meat as a poultice on Full list of author information is available at the end of the article

© 2014 Bennett et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Bennett et al. Genome Biology 2014, 15:510 Page 2 of 17 http://genomebiology.com/2014/15/11/510

Figure 1 Life cycle of Spirometra erinaceieuropaei. (A) Unembryonated eggs are released and embryonate over 8 to 14 days in water [10]. (B,C) Eggs hatch to release free-swimming coracidia (B), which parasitize (such as Cyclops sp.) and develop into procercoid larvae (C). (D) On ingestion of the by a veterbrate - such as a tadpole, frog or snake - these develop into plerocercoid larvae, also known as sparganum. The plerocercoid larvae reside in the tissues of these organisms. The larval stage infection can be passed on when the host organism is eaten. (E) Humans become infected by ingestion of a live , or in some cases direct contact, such as a poultice of infected frog tissue on the eye. A larva can also infect humans when an infected copepod is ingested. (F) The larva only develops into the adult form in the gastrointestinal tract once it reaches a definitive host, such as a or a dog, where eggs are passed in the faeces (A). Curly brackets denote known hosts, though the full extent of the possible hosts and life cycle complexity of this tapeworm species have not been well characterized. Images of S. erinaceieuropaei are guided by the experimental life history photographed by Lee et al. [10]. Source of modified images; snake [11]; frog courtesy of Anant Patel MD; cyclops [12] (Matt Wilson/Jay Clark, NOAA NMFS AFSC); dog [13] (Richard New Forest). open wounds or eyes [6]. However, infections are also larval tapeworm infection that showed migration across the thought to arise through accidental ingestion of infected brain over a 4-year period. By PCR on DNA extracted from copepods from contaminated drinking water or from abiopsysample,weidentifiedthewormasS. erinaceieuro- swallowing water whilst swimming [6,7]. Once the larva is paei, distinguishing it from S. proliferum,ataxonomically inside the human body, its final location appears unre- related species known for its ability to proliferate (with po- stricted - reported sites of infection include the eyes, sub- tentially fatal consequences) in the human host. From a cutaneous tissue, abdominal cavity, spinal cord and brain histological section, we isolated the parasite and produced a [6,8]. Pathology is associated with location; for example, draft genome sequence. We examined the known targets of infections in the brain can cause convulsions or paralysis. drugs in the parasite genome and used this to predict how The worm is usually only discovered during exploratory this parasite would have responded to chemotherapy-based surgery and treated by its subsequent removal [4,9]. treatments. From a large-scale comparison of gene families Infections with S. erinaceieuropaei and closely related across the tapeworms, we identified gene family expansions tapeworms are rare in humans. Pampiglione et al. [7] col- in this cestode, which is the first of its order (Diphyllobo- lated 300 cases worldwide between 1953 and 2003. A re- thriidea) whose genome has been sequenced. These data view of Chinese language articles revealed more cases, over contribute to the growing global database for identifying 1,000 in Mainland China since 1882 [6]. Because these in- parasites and parasite provenance and will serve as a fections occur rarely, clinicians are not likely to consider resource for identifying new treatments for sparganosis. this diagnosis until many other tests have been performed, and usually the worm is only discovered during surgery. Results Infections are even more unexpected in Europe, as there Migrating cerebral lesions indicate sparganosis were only seven reported cases in the literature before A 50-year-old man of Chinese ethnicity was admitted for 2003 [7]. Recent cases of travel- or migration-related infec- investigation of symptoms that included headaches, com- tion in Europe have occurred in the last three years [4,5]. plex partial and tonic-clonic seizures, reported episodes of In this study we describe genome sequencing of a sin- altered smell and flashback of memory and memory im- gle parasite isolated from a 50-year-old male patient pairment as well as progressive right-sided pain. The who presented in the East of England with a debilitating patient had lived in the UK for 20 years but visited his Bennett et al. Genome Biology 2014, 15:510 Page 3 of 17 http://genomebiology.com/2014/15/11/510

homeland often. MRI of the brain revealed an abnor- (Figure 3). Immediately post-operation, the patient was mality in the right medial temporal lobe of high signal given albendazole and is now systemically well. on T2 (oedema) with a cluster of ring-enhancing lesions (Additional file 1). The diagnostic possibilities were of an Molecular identification of the causative agent as S. inflammatory or a neoplastic lesion. erinaceieuropaei The patient tested negative for HIV, tuberculosis, lime DNA was extracted from the formalin-fixed paraffin em- disease, syphilis, coccidioides, histoplasma and cryptococ- bedded worm and PCR and Sanger capillary sequencing cus. A cysticercus immunoblot with patient serum was was carried using primers for cytochrome oxidase c 1 negative. Inflammatory screens for antinuclear and anti- (cox1), the mitochondrial gene often referred to as ‘the bar- neutrophil antibodies and complement (C3 and C4) were code of life’. A consensus sequence from forward and re- normal and the patient was systemically well. C-reactive verse reads was used to search against the EMBL database protein (CRP) level was within the normal range (3 mg/L), using BLASTN, and returned cox1 from S. erinaceieuropaei as was the erythrocyte sedimentation rate (6 mm/h). Com- as a top hit, notably higher than the search result against puted tomography of his chest abdomen and pelvis showed the proliferative S. proliferum, which is morphologically no abnormality. similar but would have a poor prognosis for the patient. Right temporal lobe neurosurgical biopsy showed a Alignment of the sequences confirmed this finding mixed lymphocytic (B and T cells) non-necrotising, non- (Figure 4). The sequence shared 98% identity with S. erina- granulomatous inflammation with a few plasma cells. ceieuropaei compared with 90% identity with S. proliferum. Tuberculosis was suspected but no organisms visualised. No exact cox1 match was found in S. erinaceieuropaei A series of MRI images in the ensuing four years dem- isolates that had previously been sequenced. However, onstrated contralateral gradual migration of the multilocu- the base anomalies to previously known S. erinaceieuro- late lesions from the right hemisphere through the paei cox1 sequence were subsequently confirmed in thalamus (Figure 2). Throughout the disease process, the whole genome data (Additional file 2). lesion had moved at least 5 cm through the brain. A sec- Interestingly, consensus sequence from two further ond biopsy, from the left thalamus, showed granuloma- mitochondrial genes, nad1 and cox3, were identical to S. tous inflammation, focal necrosis and an approximately erinaceieuropaei sequences from isolates collected from 1 cm ribbon-shaped cestode larval worm without mouth- frogs in Hunan province, China [14]. parts or hooklets. With the pathognominic morphology of a sparganum, it was so diagnosed at the Department of The genome of S. erinaceieuropaei Histopathology, St Thomas’ Hospital, and the Department Using 0.048 μg DNA isolated from a formalin-fixed bi- of Clinical Parasitology, Hospital for Tropical Diseases opsy, a 1.26 Gb draft assembly of the S. erinaceieuropaei

Figure 2 Sequential imaging over a 4-year period identifies migrating lesions. Sequential imaging over 4-year period: July 2008 to June 2012. All images are coronal T1 scans post gadolinium. The shifting white arrow, from right to left hemispheres, depicts the migration pattern of a cluster of ring-enhancing lesions. Bennett et al. Genome Biology 2014, 15:510 Page 4 of 17 http://genomebiology.com/2014/15/11/510

Figure 3 Morphological examination of biopsy reveals infection is sparganosis. (A) A 1.6-fold magnified view of the worm and adjacent brain tissue from biopsy; the worm is unsegmented (although there are infoldings of the cuticle), without intestine, and uniform in internal structure. (B) A host granulomatous reaction featuring focal necrosis, epithelioid and multi-nucleated giant cells of macrophage-derivation, some plasma cells and lymphocytes but no eosinophils that, considered in isolation, resembles tuberculosis (×20). (C) A 20-fold magnified view of the worm demonstrates the eosinophilic syncytial tegument, sub-tegumental nuclear layer, and the internal watery stroma that includes thin muscle fibres, round cells, and ‘empty’ tubular excretory ducts. (D) A 40-fold magnified view of the internal stroma exhibits thin eosinophilic muscle fibres and stromal cells with pale haematoxyphilic cytoplasm. All images stained with haematoxylin and eosin and scale bars are 5 mm (A), 0.5 mm (B,C) and 0.25 mm (D).

genome was assembled from two lanes of paired-end Illu- level of expansion of the assembly due to its draft nature. mina HiSeq 2000. Protein-coding genes were predicted Analysing the raw BLAST output file produced by using the software MAKER [15], which used the gene pre- CEGMA revealed that 93.1% of all 458 CEGMA genes diction software Augustus [16], GeneMark [17] and SNAP had significant BLAST matches with e-values of <1e-05 [5] alongside species-specific gene models from Caenorhab- (88.2% in predicted gene models). The fragmented nature ditis elegans and Cestodes as evidence. Genome statistics of the assembly had therefore prevented many genes from are presented in Table 1 and genome quality assessment in meeting the more stringent matching criteria set by the Materials and methods section. CEGMA. The BLAST results suggest that most of the To assess the completeness of the genome, we used the core genes are identifiable in the genome but that many Core Eukaryotic Genes Mapping Approach (CEGMA) genes are present as fragments within the assembly. software [14], which includes hidden Markov models for Using RepeatModeller [18] and RepeatMasker [19], 458 core eukaryotic genes. A subset of these, 248 genes, 43% (537 Mb) of the S. erinaceieuropaei genome was are extremely highly conserved and are believed to be masked as repetitive, including 16% long interspersed el- present in virtually all eukaryotes as single copy genes. ements (LINEs), 4% short interspersed elements (SINEs), The proportion of this subset that can be mapped into a 2% long terminal repeat (LTR) elements and 19% unclas- target genome provides an assessment of the complete- sified repetitive elements. ness of the genome. The standard CEGMA pipeline iden- We interrogated the S. erinaceieuropaei genome with tified 73 of the 248 core CEGMA genes (29.44%) in the a recently published EST data set [20] and found that all assembly as complete, with an additional 115 core 5,641 ESTs had a significant BLAST match with e-values CEGMA genes reported as partially contained (46.7%). of <1e-05, indicating that the genome contains useful The average number of predictions for each complete molecular data. Additionally, we found that 73% of ESTs gene was 1.42 (1.81 for partial genes), indicating some were within predicted gene models. Bennett et al. Genome Biology 2014, 15:510 Page 5 of 17 http://genomebiology.com/2014/15/11/510

Figure 4 Alignment of cox1 amplicon with cox1 sequence from S. erinaceieuropaei and S. proliferum. Consensus sequence from forward and reverse capillary reads of cox1 amplicon (line name = amplicon) aligned against the two species S. erinaceieuropaei (line name = Spirometra) and S. proliferum (line name = Sparganum). Bases highlighted in red differ from the amplicon; asterisks indicate consensus between all sequences.

The characteristics of the current tapeworm found that of 26 β-tubulin ESTs, 24 contained the benz- chemotherapy targets in S. erinaceieuropaei imidazole resistance-associated amino acids. We focused our initial interrogation of the genome on fea- The drug is also used to treat tapeworm tures with the highest potential clinical relevance, such as infections [23]. Schistosomes, which are from another targets of tapeworm chemotherapy. β-Tubulin is a micro- major clade of parasitic , are also sensitive to tubule component targeted by the benzimidazole class of praziquantel and the calcium channel subunit CaV2 B drugs, such as albendazole, a commonly used drug for tape- has been postulated as the drug’s target [24]. In the case worm infection. In the roundworm Haemonchus contortus, of schistosomes, the accessory β2a calcium channel sub- well-characterized mutations, namely phenylalanine to unit lacks two serine residues (likely phosphorylation tyrosine at codon 167 and 200, are known to confer resist- sites for protein kinase C) that are conserved in verte- ance to benzimidazoles in both the laboratory and field brate orthologues. When these residues are removed studies [21,22]. Searching for β-tubulin genes by TBLA from rat β2a subunits by mutagenesis, reconstituted cal- STX, using known Echinococcus multilocularis sequences, cium channels become sensitive to praziquantel in vitro revealed potential homologs in the S. erinaceieuropaei [25]. Although, there is still uncertainty about the exact genome. We aligned protein sequences with the region of target(s) of praziquantel, CaV2 B is the current best lead; interest, and found that one had tyrosine residues in the we therefore examined the sequence characteristics of positions known to confer benzimidazole resistance CaV2 B in S. erinaceieuropaei. To identify candidates, we (SPER_0000685601). A reciprocal BLAST search confirmed searched using the sequences of genes encoding calcium that the latter gene is a likely orthologue of tub-2,highly channels from the E. multilocularis genome. The latter expressed in E. multilocularis larva. We also searched for genes are long with many exons and long stretches of in- β-tubulin transcripts by BLAST in recently published EST tronic sequence. Therefore, considering the fragmented data from the larval stage of S. erinaceieuropaei [20], and nature of the S. erinaceieuropaei assembly, obtaining Bennett et al. Genome Biology 2014, 15:510 Page 6 of 17 http://genomebiology.com/2014/15/11/510

Table 1 Genome-wide statistics for the S. erinaceieuropaei using TBLASTX for evidence of homologs of these targets assembly and gene predictions (Table 2). There were significant hits for each putative Genome statistics target. Genes notable for both their high identity and com- Size of genome (Gb) 1.26 pleteness when aligned to the E. multilocularis sequences GC content (%) 46 were adenine nucleotide translocator (SPER_0000599 901), ribonucleoside diphosphate reductase (SPER_0 Number of scaffolds 483,112 000698501), calmodulin (SPER_0000219201), FK506 N50 4,647 binding protein (SPER_0 000627901) and elongation Largest scaffold (bp) 89,810 factor 2 (SPER_0001150 701). Number of predicted genes 39,856 Gene density per Mb 31.6 Genes predicted to be involved in host-parasite Length of proteome (amino acids) 6,844,057 interactions We identified the gene encoding plerocercoid growth Maximum protein length (amino acids) 1,947 factor (PGF), also known as S. erinaceieuropaei cysteine Average protein length (amino acids) 172 protease (SeCP; SPER_002801201), thought to have a Average exon length (bp) 259 role in multiple aspects of host-parasite interaction Median exon length (bp) 213 [27,28]. PGF has previously been identified as the com- Average exons per transcript 2 ponent of Spirometra species secretory products that Median exons per transcript 2 binds to human growth factor receptors, stimulating growth [27]. It has been shown to coat the plerocercoid Total length of contained introns (Mb) 41 larval tegument of Spirometra mansonoides and has Average intron length (bp) 1,065 cysteine protease activity against collagen, perhaps enab- Median intron length (bp) 418 ling the parasite to digest host tissue during migration CEGMA 73/248 [29]. Reported PGF cleavage activity against immuno- BLAST output against CEGMA 231/248 globulin may also enable the parasite to moderate in- flammation [30]. Proteases and protease inhibitors are well known for their importance in host-parasite relationships [31-33]. primarily partial BLAST matches from our gene tran- Using InterProScan 5 we identified 302 sequences that scripts for these genes was to be expected. Two out of four contained predicted proteases or protease inhibitor do- partial hits (SPER_0001175301 and SPER_0001441801) mains. Using the MEROPS databases of proteases and had an aligned region covering the phosphorylation resi- protease inhibitors [34], we classified 242 of these genes dues identified as potential drug response modulators and found the most abundant to be inhibitors of serine (225; 235 in rat β2a), and these contained a ‘sensitive’ as- proteases (Figure 5). Interestingly, two classes of prote- paragine and an alanine in the equivalent positions. The ases appeared to be considerably expanded in compari- other two hits were shorter and encoded a threonine and son to Echinococcus spp.: both the M17 (amino-terminal a serine in these positions. leucyl aminopeptidases) and the serine endopeptidase ATP-binding cassette (ABC) transporter proteins are classes S1A (chymotrypsin A-like) and S28 (lysosomal efflux pumps that have relevance to multidrug resistance Pro-Xaa carboxypeptidase-like). in and schistosomes [26]. A total number of There is also an expanded family of nine M17 prote- 19 six-transmembrane helix ABC transporter domains ases in Drosophila, found to be highly expressed in (InterPro:IPR001140, Pfam: PF00664) were detected in sperm, though their exact functional role is unknown E. multilocularis predicted gene transcripts, whereas a [35]. In the MEROPS resource Drosophila persimilis has total of 37 of these domains were present in S. erina- the most abundant representation of the M17 family ceieuropaei predicted transcripts. with 16 paralogues. In S. erinaceieuropaei we identified 28 putative M17 family proteases, 21 of which have New tapeworm drug targets in S. erinaceieuropaei clearly indicated active sites identified in the MEROPS Our next approach concentrated on finding orthologues analysis. Kunitz-type protease inhibitors (class I02) were of putative tapeworm drug targets proposed during ana- notable for their abundance in all tapeworm species, and lysis of the E. multilocularis genome [2], many of which twice as many were detected in S. erinaceieuropaei. are targets of known cancer drugs, thus opening the door Fatty acid transporters that bind low density lipoprotein to a possible drug repurposing strategy for identifying new (CD36 class B scavenger receptors) have been identified in leads for development. Predicted gene transcripts from other tapeworm genomes [2]. A TBLASTX search of the S. the assembled S. erinaceieuropaei genome were searched erinaceieuropaei transcripts using the E. multilocularis Bennett et al. Genome Biology 2014, 15:510 Page 7 of 17 http://genomebiology.com/2014/15/11/510

Table 2 Putative tapeworm drug targets for which there is a TBLASTX hit in predicted S. erinaceieuropaei gene transcript (E-value >1e-10) Putative E. multilocularis TBLASTX Percentage Percentage identity S. erinaceieuropaei drug targetsa E-value completenessb of match gene ID Thioredoxin glutathione reductase (TGR) 8e-77 28 73 SPER_0002850001 Fatty acid amide hydrolase 3e-59 39 50 SPER_0002366401 Adenine nucleotide translocator e-161 79 84 SPER_0000599901 Inosine 5′ monophosphate dehyrogenase 2e-81 42 79 SPER_0001958401 Succinate semialdehyde dehyrogenase 1e-54 25 68 SPER_0002970001 Ribonucleoside diphosphate reductase e-149 67 83 SPER_0000698501 Casein kinase II e-114 47 93 SPER_0000626801 Hypoxanthine guanine phosphoribosyltransferase 2e-35 57 43 SPER_0000257601 Glycogen synthase kinase 3 2e-69 50 84 SPER_0001364001 Proteasome subunit 4e-21 36 59 SPER_0002270001 Calmodulin 1e-94 100 98 SPER_0000219201 FK506 binding protein 3e-43 100 70 SPER_0000627901 UMP-CMP kinase 6e-24 40 73 SPER_0000808401 Na+/K+ ATPase 0 40 91 SPER_0000981501 Carbonic anhydrase II 3e-39 84 60 SPER_0002854501 NADH dehydrogenase subunit 1 3e-22 55 62 SPER_0000882501 Translocator protein 1e-26 95 44 SPER_0002949701 Elongation factor 2 0 81 70 SPER_0001150701 Cathepsin B (cysteine protease) 6e-93 62 61 SPER_0002586301 Dual-specificity mitogen activated protein 3e-54 27 86 SPER_0000571801 Purine nucleoside phosphorylase 3e-77 69 61 SPER_0000360401 aExpressed in E. multilocularis larvae. bPercentage of E. multilocularis genes covered by alignment with S. erinaceieuropaei sequence.

Figure 5 Cross-species comparison of protease and protease inhibitor classes. Protease and protease inhibitors by MEROPS classification in (green), E. multilocularis (orange) and S. erinaceieuropaei (purple) ordered alphabetically. In all species there is a large number of I02 class members, representing Kunitz-type protease inhibitors. The M17 class consists of leucyl aminopeptidases and the SO1A and S28 classes are serine endopeptidases. Bennett et al. Genome Biology 2014, 15:510 Page 8 of 17 http://genomebiology.com/2014/15/11/510

CD36 class B scavenger receptor (SCARB) sequences trematodes mansoni and returned 14 hits. These transcripts gave reciprocal BLAST were also included in the analysis, along with outgroup ge- hits in the E. multilocularis genome, closest to the nomes from Capitella teleta (a marine polychaete worm) SCARB1.2, SCARB1.3 and SCARB2 genes. Thus, it appears and Crassostrea gigas (pacific oyster). For details of each that Spirometra, similar to other tapeworms, scavenges tree see Additional file 3. A genome-wide phylogeny based lipids from its host. on genes shared between all seven species fitted expected phylogenic relationships (Figure 6). Comparison of gene families in S. erinaceieuropaei with Given the fragmentary nature of the S. erinaceieuropaei other characterized tapeworms genome, there was potential for the apparent number of Previously, no tapeworm of this order of (Diphyl- predicted genes per family to be inflated by fragments lobothriidea), which also includes the from the same gene appearing more than once in the responsible for in humans, has same family. There was indeed some indication that this been subject to whole genome sequencing. Therefore, this was the case when gene families were ranked by the ratio genome represents the first opportunity to investigate the of the number of S. erinaceieuropaei to E. multilocularis genetic differences to the more characterized Cyclophylli- genes (Additional file 4); the highest apparently expanded dea tapeworms (for example, Taenia spp. and Echinococ- protein family was titin, the largest known natural protein, cus spp.). and therefore a potential source for a huge number of To identify genes that have duplicated or been lost in S. alignable fragments. Unc-22 (twitchin), a giant intracellu- erinaceieuropaei we used the ComparaEnsembl GeneTrees lar protein, was also apparent in the top of the list. The pipeline to identify gene families across the following tape- distribution of the median length of predicted proteins worm genomes: E. multilocularis, Echinococcus granulosus, encoded by each gene family indicated that the S. erina- T. solium and . Genomes from the ceieuropaei gene predictions were short compared with

Figure 6 Phylogeny of cestodes demonstrating the relationship of S. erinaceieuropaei to the species. Phylogenetic tree of all platyhelminth ComparaEnsembl GeneTree species outrooted by Capitella teleta and Crassostrea gigas. All orthologues of gene families (protein fasta files) from Compara were filtered to include representatives from at least seven species, and these were aligned with multiple alignment program for amino acid or nucleotide sequences (MAFFT). Poor alignments were filtered out using GBlocks and the remaining concatenated to PHYLIP multiple alignment format for passing to raxmlHPC along with the partition model. raxmlHPC was run with random seed 2131. Scale bar represents length of horizontal branch corresponding to a rate of genetic change per base of 0.2. Bennett et al. Genome Biology 2014, 15:510 Page 9 of 17 http://genomebiology.com/2014/15/11/510

the other cestode species (Additional file 5). A plot of E. erinaceieuropaei classed with the STARP-like antigen fam- multilocularis median protein lengths against the number ily, although they were few and noticeably absent from the of S. erinaceieuropaei proteins in the same family con- predominant branch of this tree (GeneTree ID: 8926). firmed this trend (Additional file 6). These findings, across four antigen families, suggest it is To get a more accurate estimation of gene family ex- quite likely that S. erinaceieuropaei,andperhapsthe pansions, potentially representing specialization or adap- Diphyllobothriidea, do not, in general, share the same anti- tation within the Spirometra lineage, we ranked gene gen family expansions as the Cyclophyllidea tapeworms. families by the ratio of the total cumulative length of The most expanded gene family encoded one group of encoded S. erinaceieuropaei proteins to the cumulative dynein molecular motors. When we examined families length of the corresponding E. multilocularis proteins. A inclusive of the other 15 E. multilocularis heavy chain ratio cutoff of 3 was used to define the most expanded dyneins annotated on GeneDB we found that the dynein families, and to avoid apparent duplications that could motors in general were not expanded to the same degree be caused by divergent haplotypes within the assembly. (total length for E. multilocularis = 14,969, total length There were 83 gene families that matched these criteria for S. erinaceieuropaei = 17,067, ratio of S. erinaceieuro- and the putative function of each family was investigated paei to E. multilocularis = 1.14), indicating that this sub- (Additional file 7). The M17 protease class identified in set may have specific importance to S. erinaceieuropaei. our previous MEROPS analysis was confirmed by our One of the top gene families (rank 5), consisting of a expansion criteria (ranked 21). number of paralogs of FUT8, closest in sequence to alpha We investigated the total protein length of gene families (1, 6) fucosyltransferases, was highly expanded in S. erina- that had previously been described as expanded in tape- ceieuropaei. These enzymes have been shown to provide worm species (Table 3) [2]. Expansion of tetraspanin is core fucosylation at N-glycans [36]. Glycosyltransferases, not apparent in S. erinaceieuropaei, demonstrating that which add core 2 O-glycan branches (rank 76) and galacto- there are differences between the evolutionary history of syltransferase proteins (rank 8) were also expanded in S. these proteins between Diphyllobothriidea and Cyclophyl- erinaceieuropaei. These enzymes may create greater com- lidea tapeworm orders. Based on the GeneTree topologies, plexity at the protein structure level of glycoproteins in S. fatty acid binding proteins (GeneTree IDs: 13715, 104992, erinaceieuropaei. A number of other gene families involved 16199, 33149, 40763, 5377), appear to have expanded in post-translational modification of proteins came up as independently in H. microstoma and S. erinaceieuropaei. expanded: several kinases, primarily serine/threonine kin- In the case of the galactosyltransferases, a considerable ase families and some proteins involved in protein folding expansion is apparent in S. erinaceieuropaei within one (Kelch protein 18 and peptidylprolyl cis-trans isomerase 3). particular branch (GeneTree ID: 1090). We categorized each family into one of ten top level A number of previously described antigen families were functions to further aid visual interpretation of the data: also apparently absent from S. erinaceieuropaei -EG95, structural/cellular transport, regulation of transcription, Antigen B and GP50. There were proteins from S. post-translation modification or processing, transporter,

Table 3 Total protein length of gene families described as expanded in other tapeworm species Family E. E. T. H. C. S. S. Expansion granulosus multilocularis solium microstoma sinensis mansoni erinaceieuropaei Heat shock protein, HSP70 22,013 23,205 17,547 2,884 7,460 4,100 4,441 No Tetraspanin 8,547 9,268 8,253 6,514 2,442 3,683 3,742 No Glycosylphosphatidylinositol 2,326 8,441 4,604 5,347 0 0 0 No (GPI)-anchored protein, GP50 Antigen B 590 780 329 428 0 0 0 No EG95 306 626 1,350 0 0 0 0 No Sporozoite threonine and 1,519 2,072 2,517 772 1,661 1,783 388 No asparagine-rich protein (STARP)-like antigen Galactosyltransferase 6,685 8,690 5,689 2,937 1,264 2,157 10,296 Yes Mu class glutathione S 1,142 1,919 2,044 2,044 437 413 1,371 Yes transferase Fatty acid binding protein 2 799 586 670 1116 125 231 1,016 Yes Protein family trees containing genes belonging to the tapeworm gene expansions described by Tsai et al. [2] were extracted from the EnsemblCompara GeneTrees database. The cumulative number of codons is described here for each species. The last column notes whether this family is likely to be expanded in all tapeworms. Bennett et al. Genome Biology 2014, 15:510 Page 10 of 17 http://genomebiology.com/2014/15/11/510

receptor/signal transduction, protease, mRNA processing, This case demonstrates the long-lived and active na- metabolic processing/detoxification, cell cycle or DNA ture of a sparganosis larva in a human host, and how repair and unknown (Table 4). A large number of expan- early diagnosis and recognition of this pattern would sions contained proteins of unknown function. A BLASTX benefit future patients, minimizing tissue damage over crit- search of the S. erinaceieuropaei genes against the UniProt ical regions of brain. The patient in this case suffered from database [37] returned uncharacterised proteins with the a variety of neurological symptoms that changed in nature following exceptions. All S. erinaceieuropaei genes within over the course of the infection. It is possible that some of GeneTree 40097 returned hits to putative AMP-dependent these could have been prevented if the infection was recog- ligases in S. mansoni (2 to 7, 9 and 11), known for their ac- nized at an earlier stage. The case reported here occurred tion in processing fatty acids. Genes within GeneTree beforepublicationofastudybyGonget al. [38] that fo- 40961 returned hits to human Flt3-interacting zinc finger cused on the MRI characteristics of 18 children diagnosed proteins (which interact with the receptor tyrosine kinase with cerebral sparganosis. In the eight children that had Flt3) and genes within GeneTree 66872 gave hits to S. MRI scan data over time, migration of lesions was ob- mansoni putative rac guanyl-nucleotide exchange factor. served in three. Gong et al. also reported on the different Almost half of all gene families in our comparative MRI enhancement patterns observed, which included ring- analyses were unique to S. erinaceieuropaei (14,530 out enhancing lesions similar to those observed in this patient, of 22,026) - this large number may reflect clustering of half of which were characterized as beaded or nodular. partial components of genes. We took the 20 largest (in Here we also observe the presence of multiloculate lesions. total protein length) of these unique gene families and Therefore, in future cases, when other more common po- investigated whether we could identify related proteins tential causes (such as tuberculosis) are ruled out, a migra- by BLASTX against the UniProt database [37]. The tion pattern with ring-enhancing lesions, particularly genes within these families did not return any significant multiloculate, should raise suspicion of sparganosis. hits to annotated proteins. Sparganosis is a general term for infection with a subclass of tapeworms, as the different species that can be respon- Discussion sible are not distinguishable by eye. However, the exact spe- In this study, we report the third case of sparganosis in cies of worm can affect the prognosis for the patient. S. Europe, a cerebral infection with S. erinaceieuropaei in erinaceieuropaei is the more common causative agent. S. East Anglia, UK. After an initial biopsy failed to reveal the proliferum is the most mysterious of the sparganosis- presence of the worm, and not knowing the cause of the causing worms, as its adult form has never been observed. lesion, we observed the migration pattern of the worm de- The defining characteristic of S. proliferum is its ability to velop over four years, including its passage over to the op- proliferate in the host, and it has also been defined as a sep- posite hemisphere of the brain. Using DNA extracted arate species at the molecular level [39]. It is exceptionally from the worm, the morphological diagnosis was refined rare but has been observed in a number of cases that have to the species level, and the remainder of the sample was proved fatal. Determining the species of worm as S. erina- used to sequence and assemble the genome de novo.We ceieuropaei in this infection, based on its mitochondrial investigated known and potential drug targets in the gen- cytochrome oxidase 1 sequence, was therefore positive ome and all of the genome data are publicly available. news for the patient in this case. Identifying the species at the molecular level also gives Table 4 Summary of categorized gene family expansions us a clue as to the origin of infection. S. mansonoides is re- Top-level category Number of expansions ported as the Spirometra species found in the Americas, within category whilst S. erinaceieuropaei is the species more commonly ■ Unknown 20 reported in East Asia. A population genetics study of S. ♦ Regulation of transcription 14 erinaceieuropaei was previously conducted in Hunan ❖ Structural/cellular transport 11 province, China utilising two other mitochondrial genes, ★ Post-translational modification or 9 nad1 and cox3 [9]. In order to investigate the geographical processing origin we also sequenced these markers and found that ✪ Metabolic processing/detoxification 9 both were identical to some of the haplotypes found in the Δ Transporter 6 previous study. The fact that in just one provincial popula- tion polymorphism is seen in these genes, and that we ✚ mRNA processing 5 found sequences that were identical to some of these, sug- ↲ Receptor/signal transduction 3 gests that the infection originated in China. This is con- ○ Cell cycle or DNA repair 3 sistent with the patient’s travel history. ◉ Protease 2 With an increase in global mobility, infections such as Symbols cross-reference with those in Additional file 7. sparganosis that have previously been constricted to a Bennett et al. Genome Biology 2014, 15:510 Page 11 of 17 http://genomebiology.com/2014/15/11/510

certain region may increasingly appear in places with no [45], then, as with E. multilocularis, benzimidazole would prior history [40]. Recording such events and sharing mo- likely be suboptimal for chemotherapy against larval tape- lecular data will be critical for a greater understanding of worms of Spirometra. Using recently published EST data the epidemiology of infections and to help clinicians from the larva of S. erinaceieuropaei [20], we identified understand the potential diagnoses in their geographical β-tubulin transcripts and found that the majority contained area. the benzimidazole resistance-associated amino acids. Previously there has been a paucity of molecular data for Cases of sparganosis that were unresponsive to prazi- S. erinaceieuropaei; reports in the literature have focused quantel have previously been reported [46]. Both sensitive on the mitochondrion [41]; a small number of cloned nu- and resistant configurations of a proposed target of prazi- clear genes, such as genes encoding copper/zinc-superoxide quantel, CaV2 B, are encoded by the genome. Future stud- dismutase[42]andaced-3-likeapoptosis-relatedgene[43]; ies addressing the mode of action of praziquantel and and a survey of 910 ESTs [44]. Recently, the genomes of target protein amino acid dependencies, along with func- four different species of tapeworm were described [2,3] but, tional studies of tapeworms, may reveal the underlying for the first time, a genome from the Diphyllobothriidea genetic basis of reported resistance. The greater number order of tapeworms is now available. This genome will not of ATP cassette domains identified in S. erinaceieuropaei only enable insights into S. erinaceieuropaei but also into gene transcripts in comparison with E. multilocularis may other species of the group, including the important fish par- indicate a greater number of functional genes, with per- asites of Diphyllobothrium spp. [16]. haps greater diversity in the worm’seffluxcapabilitiesand At 1.26 Gb, the present sequence is the largest reported therefore its ability to process xenobiotic compounds. for a flatworm. In particular, it is nearly 10 times larger As new drugs against tapeworms are introduced, shared than the genomes of the published cyclophylid tapeworms molecular targets, some putative examples of which are (which range from 115 to 152 Mb) [2,3]. Some of this size summarized in our results, can continue to be assessed difference is likely to be due to the fragmentary nature of using genome level information on S. erinaceieuropaei. In the assembly. Assesment of read depth in mapped sequen- terms of suitable drug action, in cerebral cases even drugs cing data suggests that the potential contribution of split that prevent movement of the worm (and hence more alleles to the genome size is low. The S. erinaceieuropaei widespread tissue disruption) could be beneficial if cura- predicted proteome (68.4 Mb) is only somewhat larger tive surgery is delayed or not possible because of patient than those of other tapeworms (50.7 Mb in E. multilocu- health or the location of the worm. In cases that affect the laris and 46.4 Mb in Hymenolepis microstoma)andindeed central nervous system, such as in the presented case, the comparable to the proteome of the trematode S. mansoni associated side effects of any drug treatment should also (68.2 Mb); therefore, artefactual duplications in the assem- be considered. In our study we also identify proteins that bly are unlikely to account for its huge genome size. Lon- are likely to be involved in host-parasite interactions, ger introns, which average 1,065 bp in comparison with which may feed into treatment considerations or possible 573 to 863 bp in the Cyclophyllidea species, may inflate new diagnostic tests (for example, a serological reaction the genome. In addition, the genome is much more repeti- against recombinant PGF). In the present case, inflamma- tive than that sequenced from other tapeworms; almost tion in the brain in response to the worm is likely to have half of the S. erinaceieuropaei genome size is apportioned contributed to the patient’s symptoms; determining to repetitive elements - much greater than in sequenced whether or how the live worm modulates inflammation Cyclophyllidea species (7 to 11%) [2]. Of these elements, could provide vital information for choosing between drug LINEs constitute a large percentage, in contrast to cyclo- treatment or surgery. phylids, which have far fewer. We also observed expansions in serine proteases and Our initial approach to interrogating the genome concen- Kunitz-type protease inhibitors in S. erinaceieuropaei com- trated on the targets of current tapeworm chemotherapy, pared with E. multilocularis and E. granulosus,whichmay and on candidate novel targets identified from genome aid S. erinaceieuropaei in its invasion of a wide range of data. The gene for the most highly expressed β-tubulin in hosts. Interestingly, chymotrypsin A-like proteases were the larval stage of E. multilocularis (EmuJ_000672200, or the most expanded serine protease class. Within nema- tub-2) contains resistance-associated amino acids. It has todes, a large expansion of this class was also described in been suggested that this accounts for the reduced sensitivity T. muris, which lives in close association with the host gut of the cestode larval stage to benzimidazole drugs [17]. We [31]. Here, therefore, we may be observing convergent util- found an S. erinaceieuropaei orthologue, which we predict isation of this set of proteases in two unrelated parasites. to be insensitive to albendazole based on the presence of We used the genome to examine expanded gene families tyrosine amino acid residues in positions that are known to in S. erinaceieuropaei. Nine out of the 25 most frequently confer resistance in other organisms. We reasoned that if expressed Pfam domains reported in S. erinaceieuropaei the orthologue expression pattern is similar across species EST data [20] are also present in the top expanded gene Bennett et al. Genome Biology 2014, 15:510 Page 12 of 17 http://genomebiology.com/2014/15/11/510

families that we have identified. Thus, expanded gene fam- The genome represents the first draft genome from ilies (protein kinase, BTB/Kelch associated, EF hand, the order Diphyllobothriidea. Aware of the fragmented WD40 repeat, Kelch motif, fibronectin type III, zinc finger nature of the assembly, we have conservatively analysed C2H2, AMP-dependent synthetase and dynein light chain) its gene content, in the context of comparisons with are also amongst the most expressed and therefore likely other flatworms, and found a diverse set of gene expan- to be functionally important to the organism. Nine ex- sions that are not present in other tapeworms previously panded families appear to be involved in transcriptional sequenced. These include genes that may be key to the regulation. The life cycle of S. erinaceieuropaei is com- organism’s success in multiple divergent hosts and tissue posed of discrete morphologically distinct multicellular types. forms adapted for different hosts. Therefore, a complex From the genome data we have evaluated potential set of transcriptional regulators would be expected to co- druggability and our results suggest that albendazole is ordinate the expression of proteins required for each stage. unlikely to be effective but that many drugs previously A further nine expanded gene families appear to be associ- proposed as candidates for repurposing against more ated with metabolic processing or detoxification pathways. common tapeworms are likely to also be effective against It is possible that a range of metabolic and detoxification S. erinaceieuropaei. The availability of the genome data adaptations allow the parasite to live in a wide range of will provide an ongoing reference for similar molecular hosts (, , amphibians and ) as comparisons. well as in aquatic environments, as is the case for the free- swimming miracidia. The 20 expanded gene families with Materials and methods unknown function demonstrate how little we know about Ethics statement this order of tapeworms. The patient has given written consent allowing for pub- As sparganosis is a rare infection, drug re-purposing lication of this case and associated images. To remove may offer the greatest hope for the patients afflicted. In any patient data from our reference genome, sequencing terms of new potential targets for intervention, in S. eri- reads were screened against the human 1000 genome refer- naceieuropaei we observed the largest diversity of metal- ence assembly, NCBI36, [50] using the Burrows-Wheeler loproteases of the M17 class reported in any organism Aligner software package (aln and sampe command) with thus far. Leucyl aminopeptidases of the M17 class have default settings [51]. The forward and reverse reads were been considered potential targets for antimalarial drugs aligned independently and any matches were removed, [47,48] and with active drug discovery programmes un- along with the paired read, to a separate file with permis- derway [49] new open access drugs will be developed for sions that deny access. malaria that could be used against more neglected para- sites. Publicly available genome level information on S. Pathology/histology methods erinaceieuropaei, and its continual interrogation by the The neurosurgical specimen was formalin-fixed and proc- medical research community, will facilitate the necessary essed to paraffin for sectioning (5 micron thickness). inferences to be made concerning the cross-applicability Haematoxylin and eosin (H&E), PAS, Grocott methena- of the latest chemotherapy treatments. mine silver, Ziehl-Nielsen and modified Ziehl-Nielsen stains were applied. Inflammatory infiltrates were immu- nocytochemically stained with commercially available anti- Conclusions bodies to CD3 (NovoCastra, Newcastle upon Tyne, Tyne We have reported the first known case of sparganosis in and Wear, UK), CD79a (Dako, Glostrup, Hovedstaden, the UK and have diagnosed the infective species to be Denmark) and CD68 (Dako) for T cells, B cells and micro- the pseudophyllidean tapeworm S. erinaceieuropaei, glia and macrophages, respectively. For images a Leica using DNA isolated from a surgical biopsy. Previously, DMLB microscope with Leica DFC320 digital camera was sparganosis has predominantly been reported in Asia used in conjunction with Leica IM50 Image Manager Ver- and this case highlights how an increase in global mobil- sion 4.0 software (Leica Microsystems Imaging Solutions ity can bring new challenges to clinicians facing infec- Ltd, Cambridge, UK). tions from outside their usual geographical range. By describing the clinical presentation, in which a multilo- DNA extraction culate lesion was seen migrating across the brain, we A slide-mounted unstained section of worm was manually hope that this rare but debilitating infection will be on detached from substrate using an adjacent stained sample the radar as a diagnostic possibility for future cases. as a guide. The worm sample was then deparaffinized and Given the paucity of molecular data for this human the DNA extracted using the QIAamp DNA FFPE Tissue pathogen, we used the small quantity of DNA present in a Kit (Qiagen, Venlo, Limburg, Netherlands). DNA was mea- biopsy sample to generate a genome de novo. sured using Qubit® fluorometric quantification (97 ng total). Bennett et al. Genome Biology 2014, 15:510 Page 13 of 17 http://genomebiology.com/2014/15/11/510

Molecular diagnosis De novo genome assembly PCR was carried out using primers for the mitochondrial Short paired-end sequence reads were first corrected and cytochrome oxidase c subunit 1 (cox1)asusedbyLiuet al. initially assembled using SGA v0.9.7 [53]. The distribution [14]: JB3 5′-TTTTTTGGGCATCCTGAGGTTTAT-3′, of k-mers for all odd values of k between 41 and 81 was JB4 5′-TAAAGAAAGAACATAATGAAAATG-3′.PCR calculated using GenomeTools v.1.3.7 [54]. A k-mer length was also carried out using primers for nad1 (Senad1F 5′- of 75, selected as the length that produced the maximum ATAAGGTGGGGGTGATGGGGTTG-3′, Senad1R 5′- number of unique k-mers, was used for de Bruijn graph ATAAAAAATAAAAGATGAAAGGG-3′)andcox3 (Seco construction in a subsequent assembly with Velvet v1.2.03 x3F 5′- GGGTGTCATTTCTTCCTATTTTTAA-3′,Secox [55]. Approximately 1,103 CPU hours were used for as- 3R 5′- AAATGTCAATACCAAGTAACTAAAG-3′), as sembly, with a peak memory usage of 116 GB. described in Liu et al.[52].PCRs(50μl) were performed in 1× KAPA HiFi HotStart ReadyMix (Kapa Biosystems, Genome assembly quality assessment Wilmington, MA, USA) with 50 pmol of each primer and When mapped back to the assembly with SMALT, raw se- 1 μl sample (0.485 ng/μl). Reaction conditions were an ini- quencing data from each lane (lane 8823_7 and lane tial denaturation at 98°C for 5 minutes, followed by 35 cy- 9489_2) gave a peak insert size of 400 to 450 bp (Additional cles of 98°C for 20 s, 55°C for 15 s, 72°C for 30 s, then a file 8) and a low duplicate rate of 8.3% and 8.8%, respect- final extension step of 72°C for 5 minutes. After gel elec- ively. The percentage of rble as assessed using eads contain- trophoresis, bands were cut out from the agarose and ex- ing low quality sequence or adaptor sequence was tracted using the QIAquick® Gel Extraction Kit (Qiagen). negligible as assessed using Trimmomatic [56] (3.32%). The DNA was capillary sequenced at the Wellcome Trust REAPR detects possible misassembly sites using paired-end Sanger Institute using SP6 and T7 sequencing primers. A reads and then breaks the assembly to give the most con- high quality consensus sequence from both reads was used servative but accurate representation of the assembly [57]. for analysis. We found that after using REAPR the N50 only decreased by approximately 100 bp from 4.6 to 4.5 kb, with 12,687 extra scaffolds, whilst the largest scaffold remained the Paired-end illumina sequencing same. To investigate the potential for collapsed regions or DNA (48.5 ng) was used for the preparation of a paired- split alleles in the genome, we examined coverage of a sub- end Illumina library. Briefly, DNA was fragmented to 400 set of SMALT mapped data (lane 882_7) across 5-kb to 550 bp using Adaptive Focused Acoustics technology binned regions in scaffolds that were 6 kb or longer. The with the E210 instrument (Covaris, Woburn, MA, USA) mean coverage was 16.9 with a median of 15.4 (interquar- (duty cycle 20; intensity 5; cycles/bursts 200; seconds 30; tile range 6.72). We found that 7% of the genome was temperature 4°C). After the DNA was fragmented it was below 0.6× median coverage, and 8% was above 1.6× me- cleaned and concentrated with a 1:1 ratio of Ampure XP dian coverage. For the mitochondrial genome, we found magnetic beads. This was repeated after subsequent end that 137 contigs in a BLAST search against the mitochon- repair and DA-tailing reactions with the respective mod- drial sequence of a Chinese isolate [41] gave a significant ules supplied by New England Biolabs (Ipswich, MA, match with an E value of <1e-50. USA) (NEBNext™ DNA Sample Prep Reagent Set 1: E6000), following the manufacturer’s instructions. To li- Gene predictions gate sequencing adaptors, a 50 μlreactionmixturecon- Gene prediction for S. erinaceieuropaei was conducted by taining the sample was set with addition of 25 μlof2× various methods available inMAKERversion2.2.28[15]. DNA T4 ligase buffer (New England Biolabs, Inc.), 4 μl The MAKER annotation pipeline consists of four general 4 μM Illumina paired-end duplex adaptors (Integrated steps to generate high-quality annotations by taking into ac- DNA Technologies, Coralville, IA, USA) and 2 μlT4 count evidence from multiple sources. First, assembled DNA ligase. The ligation reaction was incubated at 20°C contigs are filtered against RepeatRunner [58] and a species for 30 minutes before a 1:1 ratio round of clean up, with specific repeat library (generated by RepeatModeler [18]) Ampure XP magnetic beads. This was then repeated with using RepeatMasker [19] to identify and mask repetitive el- a 0.7:1 ratio of beads to sample to remove adaptor dimers. ements in the genome. Second, gene predictors Augustus Eight cycles of PCR were carried out on the sample using 2.5.5 [59], GeneMark-ES 2.3a (self-trained) [60] and SNAP 1× KAPA HiFi HotStart ReadyMix (Kapa Biosystems) with 2013-02-16 [61] are employed to generate ab initio gene paired-end primers 1.0 and 2.0 (Ilumina). The resulting li- predictions that can use evidence within MAKER. Further brary was loaded for a paired-end sequencing run on the species-specific gene models were provided to MAKER Illumina HiSeq 2000 system with 100 cycles. This gener- using comparative algorithms against the S. erinaceieuro- ated 54,723,550,600 bp of data, representing approxi- paei genome: genBlastG [62] output of C. elegans gene mately 43× coverage. modelsfromWormbase[63]andRATT[64]outputofH. Bennett et al. Genome Biology 2014, 15:510 Page 14 of 17 http://genomebiology.com/2014/15/11/510

microstoma gene models [2]. These models cannot be influ- E. granulosus, EGRAN001 (PRJEB121); E. multilocu- enced by MAKER evidence as they were provided by gff laris, EMULTI001 (PRJEB122); H. microstoma,HMI file. Next, species-specific cDNAs available from the Inter- C001 (PRJEB124); S. mansoni, ASM23792v2 (PRJEA national Nucleotide Sequence Database Consortium [65] 36577); C. sinensis, C_sinensis-2.0 (PRJDA72781). For and proteins from related organisms were aligned against each species considered in the analysis, the longest protein the genome using BLASTN and BLASTX [66], and these translation for each gene is identified. Each protein is quer- alignments were further refined with respect to splice sites ied using NCBI-BLAST against each individual protein using Exonerate [67]. Finally, the protein homology align- within (self-species) and between all species [72]. From ments, comparative gene models and ab initio gene predic- these results graphs are constructed. Connections (edges) tions are integrated and filtered by MAKER and project between the nodes (proteins) are retained when they satisfy specific scripts to produce a set of evidence-informed gene either a best reciprocal hit (BRH) or a BLAST score ratio annotations. (BSR) over 0.33. From the graph, the connected compo- The MAKER genome annotation pipeline was run nents (that is, single linkage clusters) are extracted. Each three consecutive times. In the absence of a species- connected component represents a cluster, that is, a gene specific trained gene predictor, Augustus and SNAP family. If the cluster has greater than 750 members, the were trained using CEGMA [68] protein evidence gained graph construction and clustering steps are repeated at from the default KOGs and hidden Markov model pro- higher stringency. Proteins in the same cluster are aligned files of Cestode orthologous groups (CEOGs; unpub- using MUSCLE to obtain a multiple alignment [73]. The lished by MM and JM). The first run of MAKER was coding sequence back-translated protein-based multiple performed using the est2genome and protein2genome alignment is used as an input to the tree program, TreeB- option with the handful of -specific cDNAs, eST, as well as a multifurcated species tree which is neces- and platyhelminth protein sequences, respectively. Gene sary for reconciliation and the duplication calls on internal models obtained from the first run were used to retrain nodes [74]. The resulting trees are flattened into ortholog SNAP and models from the second run were used to re- and paralog tables of pairwise relationships between genes. train Augustus. With the trained models, MAKER was In the case of paralogs, this flattening also records the tim- run a third time using a taxonomically broader protein ing of the duplication due to the presence of extant species set that included metazoan proteins from the UniProt past the duplication, and thus implicitly outgroup lineages Complete protein database [37] and a subset of helminth before the duplication. This method produces trees with proteomes from GeneDB [69]. less anomalous topologies than single protein-based phylo- genetic methods. Comparative analysis The InterProScan 5 tool was used to provide domain-level Data availability predictions on predicted gene transcripts [70]. Protease Sequences for cox3 and nad1 amplicons from the clin- and protease inhibitors were characterized using the spe- ical sample have been deposited in GenBank under ac- cialist database MEROPS [34]. InterPro domains with the cession IDs KM031786 and KM031787, respectively. keywords protease, proteinase, proteolytic or peptidase The S. erinaceieuropaei genome, predicted transcripts, were used to obtain the geneIDs and subsequently the protein and annotation (*.GFF) files are available from transcript FASTA files for candidates. Candidate transcript the Wormbase resource [63] under BioProject PRJEB12 sequences were submitted as a batch BLAST to MEROPS, 02 (S_erinaceieuopaei_v1_0_4) [75]. which provided a report on protease family hits. Accession numbers LN000001 to LN482396 in the EnsemblCompara GeneTrees (v75) is a fault-tolerant European Nucleotide Archive (ENA) cover the S. erina- pipeline to run orthology and paralogy gene prediction ceieuropaei genome assembly. The raw data (Illumina analysis using TreeFam methodology to provide a reads) are available from ENA via accession number complete set of phylogenetic trees [71]. The Cestoda ERS182798. ComparaEnsembl GeneTree IDs and tree in species included in the comparison with S. erinaceieur- Newick format are available in Additional file 3. opaei were E. multilocularis, E. granulosus, T. solium Parasite genome assemblies used in the ComparaEnsembl and H. microstoma. species S. mansoni and GeneTree analysis are available through the Wormbase re- C. sinensis were also included in the comparison. Out- source with the following BioProject IDs and version groups included were C. teleta and C. gigas.Inter- names: E. multilocularis, PRJEB122 (EMULTI001); E. gran- national Nucleotide Sequence Database Collaboration ulosus, PRJEB121 (EGRAN001); H. microstoma,PRJEB124 (INSDC) genome assemblies and project IDs for Com- (HMIC001); S. mansoni, PRJEA36577 (ASM23792v2); C. paraEnsembl comparative analysis were as follows: C. sinensis, PRJDA72781 (C_sinensis-2.0). Outgroup genomes teleta, Capca1 (PRJNA175705); C. gigas,oyster_v9 are available from INSDC: C. teleta, PRJNA175705 (Capc (PRJNA70283); T. solium, TSMEXv1 (PRJNA170813); a1); C. gigas, PRJNA70283 (oyster_v9). Bennett et al. Genome Biology 2014, 15:510 Page 15 of 17 http://genomebiology.com/2014/15/11/510

Additional files resonance imaging; PCR: polymerase chain reaction; PGF: plerocercoid growth factor. Additional file 1: Axial MRI images from patient before surgery (2012). T1 weighted scan before (C) and after (D) the injection of Competing interests gadolinium. Gadolinium multiple ring enhancing lesions are present in The authors declare that they have no competing interests. the left thalamic region and adjacent posterior limb of the internal capsule (D). T2 weighted scan shows an area of brain oedema (A) Authors’ contributions surrounding the lesions. Diffusion weighted imaging showed no HMB: genome analysis and data interpretation, drafting the paper. IJT, ES, AT, restriction in diffusion (B), indicating that the lesion was not recently AC, NH, BH and DR: genome assembly, CEGMA, gene prediction, ischaemic. ComparaEnsembl pipeline and database queries. SS: repeat analysis. HPM, EGK and AJC: diagnosis and treatment. KSJA and AFD: histopathology. PLC Additional file 2: Cox1 in genome data. The cox1 amplicon aligned and SBL: morphological diagnosis and serology. NMA: radiology. SJP and with the best genome hit from BLAST; asterisks indicate consensus TSS: neurosurgery and neurology. AFD, EGK, HPM, HMB, MB: study design sequence and numbers on the right-hand side indicate position within each and concept. All authors read and approved the final manuscript. sequence. Base differences from previously reported S. erinaceieuropaei cox1 sequence are confirmed in the genome. The genome matches previously reported S. erinaceieuropaei cox1 sequence at two sites at the tail of the Acknowledgements amplicon sequencing where bases differ in the amplicon. This work was funded by the Wellcome Trust through their support of the Wellcome Trust Sanger Institute (grant 098051). PLC is supported by the Additional file 3: ComparaEnsembl GeneTree IDs and trees. National Institute for Health Research University College London Hospitals GeneTree IDs and corresponding trees generated from the ComparaEnsembl Biomedial Research Centre. SP is funded by a Clinician Scientist Award from clustering pipeline. the National Institute of Health Research. For serology we also acknowledge; Additional file 4: Table of gene expansions by number of genes in The Serology Section, Public Health England National Parasitology Reference family. E. multilocularis was included because it has the most complete Laboratory, Hospital for Tropical Diseases, London. We thank Makedonka reference genome of the tapeworms. This table shows a list of the most Mitreva and John Martin for providing CEOGs. obvious expanded gene families, exhibiting at least 10 times as many genes as in E. multilocularis. This criterion gave 39 expanded families in S. Author details erinaceieuropaei. Annotation and description of biological function were 1Wellcome Trust Sanger Institute, Parasite Genomics, Cambridge CB10 1SA, provided by examining the annotation for the E. multilocularis and S. UK. 2Department of Histopathology Section, Addenbrookes’s NHS Trust, mansoni members of each GeneTree. Cambridge CB2 0QQ, UK. 3Department of Infectious Diseases, Addenbrooke’s 4 Additional file 5: Distribution of median protein length of NHS Trust, Cambridge CB2 0QQ, UK. Department of Radiology, ’ 5 predicted Cestoda gene families. Histogram showing the distribution Addenbrookes s NHS Trust, Cambridge CB2 0QQ, UK. Department of ’ 6 of the median protein length for each species, within each family Neurosurgery, Addenbrookes s NHS Trust, Cambridge CB2 0QQ, UK. Hospital predicted by the EnsemblCompara GeneTree pipeline. Median protein for Tropical Diseases and London School of Hygiene and Tropical Medicine, London WC1E 6JD, UK. 7Department of Histopathology, St Thomas’s Hospital, length has a more compact distribution in S. erinaceieuropaei than the 8 other species, and drops comparatively sharply after 300 amino acids. London SE1, UK. Biodiversity Research Center, Academia Sinica, Taipei 11529, Taiwan. 9Eagle Genomics, Babraham Research Campus, Babraham, Additional file 6: Scatterplot of the E. multilocularis median protein Cambridge CB22 3AT, UK. length in each EnsemblCompara GeneTree family against the number of S. erinaceieuropaei proteins predicted in the family. Received: 19 September 2014 Accepted: 28 October 2014 The function of the line drawn is Y = 0.002324 × X + 0.686443, with a P-value of <2e-16 and an adjusted R2 value of 0.1648. A large amount of biological variation is likely to explain the lack of points contributing to References this linear model. Note, however, that the linear model does extrapolate 1. World Health Organization: Neglected Tropical Diseases. [http://www. to titin, the largest known natural protein, which artificially appears who.int/neglected_diseases/diseases/en/] expanded in S. erinaceieuropaei. The total predicted protein length of the 2. Tsai IJ, Zarowiecki M, Holroyd N, Garciarrubio A, Sanchez-Flores A, Brooks KL, titin family in E. multilocularis is 11,194 amino acids, which is greater than Tracey A, Bobes RJ, Fragoso G, Sciutto E, Aslett M, Beasley H, Bennett HM, the 9,086 amino acids in S. erinaceieuropaei. Cai J, Camicia F, Clark R, Cucher M, De Silva N, Day TA, Deplazes P, Estrada Additional file 7: Gene families obtained from the K, Fernandez C, Holland PWH, Hou J, Hu S, Huckvale T, Hung SS, EnsemblCompara GeneTrees pipeline ranked by the ratio of the Kamenetzky L, Keane JA, Kiss F, et al: The genomes of four tapeworm cumulative length of proteins in the gene family of S. species reveal adaptations to . Nature 2013, 496:57–63. erinaceieuropaei to E. multilocularis. The criteria for expansion was set 3. Zheng H, Zhang W, Zhang L, Zhang Z, Li J, Lu G, Zhu Y, Wang Y, Huang Y, at 3 and above, resulting in 83 gene families for further investigation. The Liu J, Kang H, Chen J, Wang L, Chen A, Yu S, Gao Z, Jin L, Gu W, Wang Z, table contains the consensus indicative annotation for E. multilocularis and S. Zhao L, Shi B, Wen H, Lin R, Jones MK, Brejova B, Vinar T, Zhao G, McManus mansoni members of the gene family, a brief description of the associated DP, Chen Z, Zhou Y, et al: The genome of the hydatid tapeworm biological function (as ascertained by the supported text in UniProt, NCBI Echinococcus granulosus. Nat Genet 2013, 45:1168–1175. and Pfam) and an assignment to one of ten top-level categories. These 4. Tappe D, Berger L, Haeupler A, Muntau B, Racz P, Harder Y, Specht K, catergories are: structural/cellular transport (❖), regulation of transcription Prazeres da Costa C, Poppert S: Case report: molecular diagnosis of (♦), post-translation modification or processing (★), transporter (Δ), receptor/ subcutaneous Spirometra erinaceieuropaei sparganosis in a Japanese signal transduction (↲), protease (◉), mRNA processing (✚), metabolic immigrant. Am J Trop Med Hyg 2013, 88:198–202. processing/detoxification (✪), cell cycle or DNA repair (○)andunknown(■). 5. Schauer F, Poppert S, Mockenhaupt M, Muntau B, Jakob T, Jakob T: Travel- Additional file 8: Insert size of two lanes of paired-end Illumina acquired subcutaneous Sparganum proliferum infection diagnosed by HiSeq 2000 data. Raw reads for each sequencing lane were mapped molecular methods. Br J Dermatol 2014, 170:741–743. back to the genome assembly using SMALT to determine the insert 6. Li M-W, Song H-Q, Li C, Lin H-Y, Xie W-T, Lin R-Q, Zhu X-Q: Sparganosis in size distribution. Both data sets were within the required range. mainland China. Int J Infect Dis 2011, 15:e154–e156. 7. Pampiglione S, Fioravanti ML, Rivasi F: Human sparganosis in Italy. Case report and review of the European cases. APMIS 2003, 111:349–354. 8. Ho T-H, Lin M-C, Yu W-W, Lai P-H, Sheu S-J, Bee Y-S: Ocular sparganosis Abbreviations mimicking an orbital idiopathic inflammatory syndrome. Orbit 2013, bp: base pair; CEGMA: Core Eukaryotic Genes Mapping Approach; 32:395–398. EST: expressed sequence tag; INSDC: International Nucleotide Sequence 9. Lee K-J, Myung N-H, Park H-W: A case of sparganosis in the leg. Korean J Database Collaboration; LINE: long interspersed element; MRI: magnetic Parasitol 2010, 48:309–312. Bennett et al. Genome Biology 2014, 15:510 Page 16 of 17 http://genomebiology.com/2014/15/11/510

10. Lee S-H, We J-S, Sohn W-M, Hong S-T, Chai J-Y: Experimental life history of 36. Becker DJ, Lowe JB: Fucose: biosynthesis and biological function in Spirometra erinacei. Korean J Parasitol 1990, 28:161–173. mammals. Glycobiology 2003, 13:41R–53R. 11. Encyclopedia of Life: Zaocys dhumnades Cope, 1860. [http://eol.taibif.tw/ 37. Faruque A, Alpi E, Antunes R, Arganiska J, Casanova EB, Bely B: Activities at pages/72282#3] the Universal Protein Resource (UniProt). Nucleic Acids Res 2014, 42:7486. 12. NOAA Photo Library: Zooplankton. Copepod with eggs. [http://www. 38. Gong C, Liao W, Chineah A, Wang X, Hou BL: Cerebral sparganosis in photolib.noaa.gov/htmls/fish3260.htm] children: epidemiological, clinical and MR imaging characteristics. 13. Wikipedia: Welsh sheepdog. [http://en.wikipedia.org/wiki/Image: BMC Pediatr 2012, 12:155. Welsh_Sheepdog.jpg] 39. Miyadera H, Kokaze A, Kuramochi T, Kita K, Machinami R, Noya O, Alarcón 14. Liu W, Zhao GH, Tan MY, Zeng DL, Wang KZ, Yuan ZG, Lin RQ, Zhu XQ, Liu Y: de Noya B, Okamoto M, Kojima S: Phylogenetic identification of Survey of spirometra erinaceieuropaei spargana infection in the frog Rana Sparganum proliferum as a pseudophyllidean cestode by the sequence nigromaculata of the Hunan Province of China. Vet Parasitol 2010, 173:152–156. analyses on mitochondrial COI and nuclear sdhB genes. Parasitol Int 2001, 15. Holt C, Yandell M: MAKER2: an annotation pipeline and genome-database 50:93–104. management tool for second-generation genome projects. BMC Bioinformatics 40. Harvey K, Esposito DH, Han P, Kozarsky P, Freedman DO, Plier DA, Sotir MJ: 2011, 12:491. Surveillance for travel-related disease–GeoSentinel Surveillance System, 16. Scholz T, Garcia HH, Kuchta R, Wicht B: Update on the human broad United States, 1997–2011. MMWR Surveill Summ 2013, 62:1–23. tapeworm (genus diphyllobothrium), including clinical relevance. 41. Liu G-H, Li C, Li J-Y, Zhou D-H, Xiong R-C, Lin R-Q, Zou F-C, Zhu X-Q: Clin Microbiol Rev 2009, 22:146–160. Characterization of the complete mitochondrial genome sequence of 17. Olson PD, Zarowiecki M, Kiss F, Brehm K: Cestode genomics - progress and Spirometra erinaceieuropaei (Cestoda: ) from China. prospects for advancing basic and applied aspects of flatworm biology. Int J Biol Sci 2012, 8:640–649. Parasite Immunol 2012, 34:130–150. 42. Li A-H, Na B-K, Ahn S-K, Cho S-H, Pak J-H, Park Y-K, Kim T-S: Functional 18. RepeatModeler. [http://www.repeatmasker.org/RepeatModeler.html] expression and characterization of a cytosolic copper/zinc-superoxide 19. RepeatMasker. [http://www.repeatmasker.org/] dismutase of Spirometra erinacei. Parasitol Res 2010, 106:627–635. 20. Kim D-W, Yoo WG, Lee M-R, Yang H-W, Kim Y-J, Cho S-H, Lee W-J, Ju J-W: 43. Lee S-U, Huh S: Cloning of the novel putative apoptosis-related gene of Transcriptome sequencing and analysis of the zoonotic parasite Spirometra erinacei (Order ). Korean J Parasitol 2006, Spirometra erinacei spargana (plerocercoids). Parasit Vectors 2014, 7:368. 44:233. 21. Kwa MSG, Veenstra JG, Van DM, Roos MH: b-tubulin genes from the 44. Kim D-W, Kim D-W, Yoo WG, Nam S-H, Lee M-R, Yang H-W, Park J, Lee K, parasitic haemonchus contortus modulate drug resistance in Lee S, Cho S-H, Lee W-J, Park H-S, Ju J-W: SpiroESTdb: a transcriptome Caenorhabditis elegans. J Mol Biol 1995, 246:500–510. database and online tool for sparganum expressed sequences tags. 22. Prichard RK: Genetic variability following selection of Haemonchus BMC Res Notes 2012, 5:130. contortus with anthelmintics. Trends Parasitol 2001, 17:445–453. 45. McCarroll SA, Murphy CT, Zou S, Pletcher SD, Chin C-S, Jan YN, Kenyon C, 23. Sinha S, Sharma BS: Neurocysticercosis: a review of current status and Bargmann CI, Li H: Comparing genomic expression patterns across management. J Clin Neurosci 2009, 16:867–876. species identifies shared transcriptional profile in aging. Nat Genet 2004, 24. Greenberg RM: Are Ca2+ channels targets of praziquantel action? Int J 36:197–204. Parasitol 2005, 35:1–9. 46. Kim DG, Paek SH, Chang KH, Wang KC, Jung HW, Kim HJ, Chi JG, Choi KS, 25. Kohn AB, Roberts-Misterly JM, Anderson PA, Greenberg RM: Creation by Han DH: Cerebral sparganosis: clinical manifestations, treatment, and mutagenesis of a mammalian Ca2+ channel β subunit that confers outcome. J Neurosurg 1996, 85:1066–1071. praziquantel sensitivity to a mammalian Ca2+ channel. Int J Parasitol 47. Lew VL, Macdonald L, Ginsburg H, Krugliak M, Tiffert T: Excess 2003, 33:1303–1308. haemoglobin digestion by malaria parasites: a strategy to prevent 26. Greenberg RM: ABC multidrug transporters in schistosomes and other premature host cell lysis. Blood Cells Mol Dis 2004, 32:353–359. parasitic flatworms. Parasitol Int 2013, 62:647–653. 48. Stack CM, Lowther J, Cunningham E, Donnelly S, Gardiner DL, Trenholme 27. Phares CK: The growth hormone-like factor from plerocercoids of the KR, Skinner-Adams TS, Teuscher F, Grembecka J, Mucha A, Kafarski P, Lua L, tapeworm Spirometra mansonoides is a multifunctional protein. In Effects Bell A, Dalton JP: Characterization of the Plasmodium falciparum M17 On Host Hormones and Behavior, Parasites and Pathogens. Edited by Beckage leucyl aminopeptidase. A protease involved in amino acid regulation NE. 1997:99–112. with potential for antimalarial drug development. J Biol Chem 2007, 28. Liu LN, Cui J, Zhang X, Wei T, Jiang P, Wang ZQ: Analysis of structures, 282:2069–2080. functions, and epitopes of cysteine protease from Spirometra 49. Medicines for Malaria Venture. [http://www.mmv.org/] erinaceieuropaei Spargana. Biomed Res Int 2013, 2013:198250. 50. 1000 Genomes. [http://www.1000genomes.org/category/reference] 29. Phares CK: An unusual host-parasite relationship: the growth hormone-like 51. Li H, Durbin R: Fast and accurate long-read alignment with Burrows- factor from plerocercoids of spirometrid tapeworms. Int J Parasitol 1996, Wheeler transform. Bioinformatics 2010, 26:589–595. 26:575–588. 52. Liu W, Liu GH, Li F, He DS, Wang T, Sheng XF, Zeng DL, Yang FF, Liu Y: 30. Kong Y, Chung YB, Cho SY, Kang SY: Cleavage of immunoglobulin G by Sequence variability in three mitochondrial DNA regions of Spirometra excretory-secretory cathepsin S-like protease of Spirometra mansoni erinaceieuropaei spargana of human and health significance. plerocercoid. Parasitology 1994, 109:611–621. J Helminthol 2012, 86:271–275. 31. Foth BJ, Tsai IJ, Reid AJ, Bancroft AJ, Nichol S, Tracey A, Holroyd N, Cotton 53. Simpson JT, Durbin R: Efficient de novo assembly of large genomes using JA, Stanley EJ, Zarowiecki M, Liu JZ, Huckvale T, Cooper PJ, Grencis RK, compressed data structures. Genome Res 2012, 22:549–556. Berriman M: Whipworm genome and dual-species transcriptome analyses 54. Gremme G, Steinbiss SKS: GenomeTools: a comprehensive software provide molecular insights into an intimate host-parasite interaction. library for efficient processing of structured genome annotations. Nat Genet 2014, 2013:1–10. IEEE/ACM Trans Comput Biol Bioinform 2013, 10:645–656. 32. Molehin AJ, Gobert GN, McManus D: Serine protease inhibitors of parasitic 55. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly helminths. Parasitology 2012, 139:681–695. using de Bruijn graphs. Genome Res 2008, 18:821–829. 33. Horn M, Fajtová P, Rojo Arreola L, Ulrychová L, Bartošová-Sojková P, 56. Bolger AM, Lohse M, Usadel B, Planck M, Plant M, Mühlenberg A: Franta Z, Protasio AV, Opavský D, Vondrášek J, McKerrow JH, Mareš M, Trimmomatic: a flexible trimmer for Illumina sequence data. Caffrey CR, Dvořák J: Trypsin- and Chymotrypsin-like serine proteases Bioinformatics 2014, 30:2114–2120. in –‘the undiscovered country’. PLoS Negl Trop 57. Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD: REAPR: a Dis 2014, 8:e2766. universal tool for genome assembly evaluation. Genome Biol 2013, 14:R47. 34. Rawlings ND, Waller M, Barrett AJ, Bateman A: MEROPS: the database of 58. Smith CD, Edgar RC, Yandell MD, Smith DR, Celniker SE, Myers EW, Karpen proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res GH: Improved repeat identification and masking in Dipterans. Gene 2007, 2014, 42:D503–D509. 389:1–9. 35. Dorus S, Wilkin EC, Karr TL: Expansion and functional diversification of a 59. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B: AUGUSTUS: leucyl aminopeptidase family that encodes the major protein ab initio prediction of alternative transcripts. Nucleic Acids Res 2006, constituents of Drosophila sperm. BMC Genomics 2011, 12:177. 34:W435–W439. Bennett et al. Genome Biology 2014, 15:510 Page 17 of 17 http://genomebiology.com/2014/15/11/510

60. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M: Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008, 18:1979–1990. 61. Korf I: Gene finding in novel genomes. BMC Bioinformatics 2004, 5:59. 62. She R, Chu JS-C, Uyar B, Wang J, Wang K, Chen N: genBlastG: using BLAST searches to build homologous gene models. Bioinformatics 2011, 27:2141–2143. 63. Harris TW, Baran J, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, Done J, Grove C, Howe K, Kishore R, Lee R, Li Y, Muller H-M, Nakamura C, Ozersky P, Paulini M, Raciti D, Schindelman G, Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Wong JD, Yook K, Schedl T, Hodgkin J, Berriman M, Kersey P, et al: WormBase 2014: new views of curated biology. Nucleic Acids Res 2014, 42:D789–D793. 64. Otto TD, Dillon GP, Degrave WS, Berriman M: RATT: Rapid Annotation Transfer Tool. Nucleic Acids Res 2011, 39:e57. 65. Nakamura Y, Cochrane G, Karsch-Mizrachi I: The international nucleotide sequence database collaboration. Nucleic Acids Res 2013, 41:D21–D24. 66. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389–3402. 67. Slater GSC, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005, 6:31. 68. Parra G, Bradnam K, Korf I: CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 2007, 23:1061–1067. 69. Logan-Klumpler FJ, De Silva N, Boehme U, Rogers MB, Velarde G, McQuillan JA, Carver T, Aslett M, Olsen C, Subramanian S, Phan I, Farris C, Mitra S, Ramasamy G, Wang H, Tivey A, Jackson A, Houston R, Parkhill J, Holden M, Harb OS, Brunk BP, Myler PJ, Roos D, Carrington M, Smith DF, Hertz-Fowler C, Berriman M: GeneDB–an annotation database for pathogen. Nucleic Acids Res 2012, 40:D98–D108. 70. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong S-Y, Lopez R, Hunter S: InterProScan 5: genome- scale protein function classification. Bioinformatics 2014, 30:1236–1240. 71. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 2009, 19:327–335. 72. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL: NCBI BLAST: a better web interface. Nucleic Acids Res 2008, 36:W5–W9. 73. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113. 74. TreeBeST. [http://treesoft.sourceforge.net/treebest.shtml] 75. WormBase ParaSite: Spirometra erinaceieuropaei. [http://parasite.wormbase. org/Spirometra_erinaceieuropaei_prjeb1202/Info/Index]

doi:10.1186/s13059-014-0510-3 Cite this article as: Bennett et al.: The genome of the sparganosis tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion. Genome Biology 2014 15:510.

Submit your next manuscript to BioMed Central and take full advantage of:

• Convenient online submission • Thorough • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, and Google Scholar • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit