<<

Comparative of the major parasitic

International Helminth Consortium

Supplementary Information

Introduction ...... 4 Contributions from Consortium members ...... 5 Methods ...... 6 1 Sample collection and preparation ...... 6 2.1 Data production, Sanger Institute (WTSI) ...... 12 DNA template preparation and sequencing...... 12 assembly ...... 13 Assembly QC ...... 14 prediction ...... 15 Contamination screening ...... 16 2.2 Data production, McDonnell Genome Institute (MGI) ...... 18 Genome sequencing library preparation ...... 18 Genome assembly ...... 19 Assembly QC / Contamination screening ...... 20 Transcriptome sequencing and assembly ...... 20 Gene prediction ...... 20 2.3 Data production, Blaxter and Neglected Genomics (BaNG) ...... 21 Genome sequencing library preparation and sequencing ...... 21 Genome assembly ...... 22 Assembly QC ...... 22 Gene prediction ...... 22 3 Functional annotation ...... 22 Assigning names to predicted ...... 22 Assigning GO terms to predicted proteins ...... 23 4 Repeat libraries and repeat-masking ...... 24 5 Regression model for genome size ...... 25 6 Mitochondrial genome analysis ...... 26 7 Defining high-quality ‘tier 1’ for downstream analyses ...... 26 8 Compara database of gene families ...... 27 Construction of the in-house Compara database ...... 27 Identification of gene families, orthologs and paralogs ...... 28

1 9 Identification of synapomorphic gene families ...... 28 10 Phylogenetic analysis of candidate lateral gene transfers ...... 28 11 Network representation of gene families ...... 29 12 based on gene family presence/absence ...... 29 13 Identification of gene family expansions ...... 29 14 Species Tree ...... 32 15 Novel domain combinations ...... 32 16 Ion Channels and ABC Transporters ...... 33 17 ...... 33 18 Kinase prediction ...... 33 20 Signal peptide for secretion and TM domains predictions ...... 34 21 InterPro and GO annotations ...... 34 22 Species-level functional enrichment (GO / InterPro / ) analysis ...... 34 23 SCP/TAPS ...... 34 24 GPCR analysis ...... 35 25 Metabolism ...... 36 Assigning ECs to predicted proteins and generating high-confidence EC predictions ...... 36 Reconstructing metabolic pathways and pathway -filling ...... 37 Analysis of KEGG metabolic modules and pathways ...... 37 Analysis of chokepoints in metabolic pathways ...... 38 Carbohydrate active (CAZymes) ...... 39 26 Identification of Potential Drug Targets and Drugs...... 39 Known anthelmintic drugs and compounds ...... 39 Dendrogram of known anthelmintic compounds...... 40 Identifying potential helminth drug targets ...... 40 Identifying potential new anthelmintic drugs in ChEMBL ...... 44 Diversity analysis for creating a ‘diverse screening set’ ...... 45 Identifying compounds available for purchase using ZINC15 ...... 46 Self-organising map of compounds ...... 47 Supplementary Results ...... 48 1. Genomic diversity in parasitic and platyhelminths ...... 48 1.1 Genome sequencing and assembly ...... 48 Sequencing strategy ...... 48 Genome assembly pipeline validation ...... 48 Assembly statistics ...... 49 1.2 Gene-finding and annotation ...... 49 Gene-finding ...... 50 Gene count ...... 50 Assessing gene-finding accuracy using curated gene sets ...... 51 1.3 Variation in genome size ...... 51

2 Evidence for recent genome duplication events ...... 52 Effect of repeat content on genome size ...... 53 Effect of coding DNA and intron content on genome size ...... 55 Contribution of repeats, , introns and intergenic DNA to genome size variation...... 56 1.4 Mitochondrial Genome Evolution ...... 57 2. Novel gene families and gene family evolution ...... 58 2.1 Compara database of gene families ...... 58 Overview of the Compara database ...... 58 Phylogenetic pattern of gene family presence and absence ...... 59 2.2 Synapomorphies ...... 60 2.3 Gene family expansions ...... 61 Gene families expanded in parasites ...... 62 Expansions of gene families involved in -parasite interactions...... 62 Expansions of helminth immunity- and development-related families ...... 65 Inexplicable expansions ...... 67 2.4 Hypothetical genes and families ...... 67 2.5 GO Annotation Enrichment ...... 68 2.6 Species Tree ...... 69 3. Proteins historically targeted for drug development ...... 71 3.1 SCP/TAPS protein family ...... 71 3.2 Proteases and Inhibitors ...... 72 ...... 72 proteases ...... 73 proteases, prolyl oligopeptidases, and ...... 74 Protease inhibitors ...... 75 Trypsin, trypsin-like, and chymotrypsin/elastase protease inhibitors ...... 75 Other protease inhibitors ...... 76 3.3 GPCRs ...... 76 3.4 Pentameric Ligand-Gated Ion Channels ...... 78 3.5 ABC Transporters ...... 79 3.6 The Kinomes of Nematodes and Platyhelminths ...... 80 3.7 Novel domain combinations, including those related to proteases ...... 80 4. Metabolic reconstructions of nematode and platyhelminth parasites ...... 81 4.1 Overall metabolic potential ...... 81 Number of EC numbers and enzymes ...... 81 Variability in KEGG pathway coverage ...... 82 Coverage of KEGG modules ...... 83 4.2 Lipid metabolism ...... 83 β-oxidation ...... 83 Ketone bodies ...... 84

3 4.3 The glyoxylate cycle: linking lipid and carbohydrate metabolism ...... 85 4.4 Carbohydrate metabolism ...... 86 Scavenging carbohydrates from host food ...... 86 Malate dismutation ...... 86 Propionate shunt ...... 87 The pathway ...... 88 The GABA shunt ...... 88 Glycogen synthesis ...... 89 4.5 Amino acid metabolism ...... 89 Glycine cleavage system ...... 89 Amino acid auxotrophies ...... 90 4.6 Nucleotide metabolism ...... 90 synthesis ...... 90 synthesis ...... 91 4.7 Metabolism of cofactors and vitamins ...... 92 Haem metabolism ...... 92 Vitamin metabolism ...... 92 5. Identifying New of Anthelmintic Drug Targets and Drugs ...... 93 5.1 Summary of approach ...... 93 5.2 Ranking helminth proteins as likely drug targets ...... 94 5.3 Choice of weightings in the scoring system for targets ...... 94 5.4 Identifying potential new anthelmintic targets, and drugs ...... 95 5.5 Diverse screening set ...... 96 5.6 A high priority target set, giving a smaller diverse screening set ...... 96 Supplementary Figures ...... 97 Supplementary Tables ...... 103 Additional References ...... 105

Introduction

In this Supplementary Information we provide full details of the methods used for data production at three sequencing centres. In the Results section, we also provide data analysis and interpretation at a level of detail that is not possible in the main text. The structure of the document reflects that of the main text, so the underlying evidence and reasoning for each assertion in the main text can be explored in more detail. Contributions of individual members of the International Helminth Genomes Consortium are summarised (see Contributions below) and the authors that provided each of the species are

4 identified in the ‘Methods: sample collection and preparation’ section. In addition, we have identified the key authors for each major analysis section.

Contributions from Consortium members

A.C., J.A.C., D.R.L., M.B., M.L.B., M.M. and N.H., R.T., R. M. M wrote the manuscript; M.M. and M.B. conceived and directed the project; M.M., J.A.C., N.H., M.L.B. and M.B. planned and managed the project; N.H., H.B., A.T., J.M., K.H.P., P.O., Z.X. and D.W.T. were involved with genome sequencing; T.K., H.M.B., T.H., R.M.M., L.C.W., A.O., C.C., P.J.C., E.D., S.B., J.H., M.L.E., P.F.R., H.S., J.S.G., J.B.M., J.C., R.T., T.S., Ma.S., P.D.O., A.E., Da.R., F.A., E.C., M.K., K.S.E., P.H., J.M.H., J.F.U., D.E.H., D.Z., S.B., K.P., F.P., Y.H., T.C.B., M.G.S., K.A., J.A., B.M., V.N.T., D.W.T. and G.B. prepared or provided parasite material or nucleic acid; A.C., J.A.C., D.G., Di.R., E.S., A.T., J.L. and B.A.R. provided production ; E.S., A.C., B.H., B.A.R., J.M., K.H.P., P.O., Z.X, D.R.L., Ge.K., S.J., Ga.K. and K.L.H. annotated the genomes; I.J.T. assembled genomes at WTSI J.M., K.H.P., P.O. and Z.X. assembled genomes at MGI; Ge.K., S.J., D.R.L. and Ga.K. assembled genomes at BaNG; T.K. analysed mitochondrial genomes; J.A.C. and A.C. analysed variation in genome size; A.J.R., My.S., J.L., A.C., D.R.L. and B.A.R. analysed GO terms and domains; J.A.C. developed visualisation software and produced the species tree; D.R.L. and M.L.B. analysed synapomorphies; A.C., N.H., J.A.C., S.R.D., Di.R., A.J.R., G.R., M.B., B.A.R., and R.T. analysed expanded gene families; A.J.R. and A.C. analysed hypothetical proteins; T.H.K., T.J.L., H.M.K. and I.J.T analysed SCP/TAPS; N.D.R analysed proteases; N.J.W., M.Z. and T.D. analysed GPCRs; R.N.B. and M.Z. analysed ion channels; B.A.R. and J. M. analysed kinases;

5 R.T., S.S., J.P., A.C., J.A.C. and T.K. analysed metabolic pathways; P.M., A.R.L., J.A.C., J.L., A.C., and N.O.B. performed chemogenomics analyses.

Methods

1 Sample collection and preparation

Acanthocheilonema viteae (in collaboration with Kenneth Pfarr) Adult female A. viteae were isolated from the fat bodies of infected jirds, and adhering host material reduced by overnight incubation on a rhesus macaque fibroblast feeder cell monolayer culture. Total DNA was extracted using a standard Qiagen kit after homogenisation of the nematodes.

Ancylostoma caninum (in collaboration with John Hawdon) The sequenced strain (Baltimore) was collected near Baltimore in the 1960’s by Gerhard Schad, and has been maintained in dogs (Canis familiaris) continuously since. Material for sequencing was obtained by John Hawdon who has been maintaining the parasite at George Washington University since 2001. isolation and extraction of nucleic acids was performed by Verena Gelmedin and others in the Hawdon lab using Qiagen kits. Voucher specimens are on deposit in the U.S. National Parasite Collection (accession number 100655).

Ancylostoma ceylanicum (in collaboration with John Hawdon) The strain was obtained from Ricardo Fujiwara at the Federal University of Minas Gerais, and maintained at the Hawdon laboratory in dogs and hamsters since 2007. Worm isolation and extraction of nucleic acids was performed by Verena Gelmedin and others in the Hawdon lab, or at MGI using Qiagen kits. Voucher specimens are on deposit in the U.S. National Parasite Collection (accession number 102954).

Ancylostoma duodenale (in collaboration with John Hawdon) Genomic DNA was isolated by John Hawdon, using Qiagen kits, and confirmed as A. duodenale by PCR before being sent for sequencing. No voucher has been deposited and no laboratories are known to be currently maintaining A. duodenale in .

Angiostrongylus cantonensis (in collaboration with Lian-Chen Wang) Genomic DNA was prepared from a single A. cantonensis male adult () using Genomic DNA Mini Kit (Geneaid Biotech). The worm was obtained from the lung of a Sprague Dawley rat.

6 Angiostrongylus costaricensis (in collaboration with Antonio Osuna, Teresa Cruz-Bustos and Mercedes Gomez Samblas) A. costaricensis (Costa Rica) were dissected from a sigmodon blood vessel in 2009 and stored in ethanol. A single female adult was used to prepare genomic DNA (Qiagen kit) which was sent to WTSI.

Anisakis simplex (in collaboration with Carmen Cuellar) Approximately 300 A. simplex third stage larvae were collected from viscera and flesh of blue whiting (Micromesistius poutassou) obtained from local markets. DNA was preparation was based on a published protocol 73. DNA from 48 larvae was used to generate a short fragment illumina library and DNA from 300 larvae was used to generate a 3 kb mate pair illumina library (see Supplementary Table 22).

Ascaris lumbricoides (in collaboration with Philip Cooper) A. lumbricoides worms were obtained from young children in Ecuador with A. lumbricoides ova in stool samples. Children were treated with a single dose of 5 mg/kg of Combantrin ( and pamoate, ) and stool samples were collected for 24 hours after treatment. Expelled worms were washed thoroughly in sterile saline and stored in liquid nitrogen before being shipped on dry ice to WTSI. To enrich the sample for non-diminuted genomic material, testes were dissected from a single male adult and DNA was extracted using Qiagen Genomic-tip-20.

Brugia pahangi (in collaboration with Eileen Devaney) The strain of B. pahangi sequenced derives from the original B. pahangi isolated in 1955 from a cat in Malaysia. It was transported to the School of Tropical Medicine in Liverpool and maintained there in cats from the late 50s to late 1970s, and then maintained by serial passage through Aedes aegypti mosquitoes and gerbils. B. pahangi adult worms were collected from gerbils (Meriones unguiculatus) infected 4-5 months previously with 250 infective larvae (L3) by intraperitoneal injection. Approximately 500 adult worms were used to extract genomic DNA, using a standard -chloroform extraction.

Brugia timori (in collaboration with Rick M. Maizels and Felix Partono) B. timori was obtained from a patient in Flores Island, and passaged in Meriones unguiculatus jirds and Aedes togoi mosquitoes by Felix Partono and Purnomo, University of Indonesia, Jakarta. An aliquot of 572 L3 harvested from mosquitoes fed with infected blood 14 days earlier and cryopreserved, were used for preparation of DNA (Gentra Puregene DNA extraction Tissue kit) in c. 1983.

Caenorhabditis elegans C. elegans (N2), originally obtained from Julie Ahringer at The , Cambridge, was maintained at WTSI. Genomic DNA was prepared from adult worms using Promega Wizard.

Cylicostephanus goldi (in collaboration with Jane Hodgkinson) A 20 worm sample of adult parasites was recovered from the large intestine of naturally infected of unknown geographical location. Parasites were washed in sterile PBS, heads were removed and stored in 10% formalin. The body of each worm was snap frozen in liquid nitrogen

7 prior to transfer to -80°C. Species identification was based on morphological parameters. DNA was prepared using Dneasy Blood and Tissue Kit (Qiagen) and sent to WTSI.

Dibothriocephalus latus (in collaboration with Tomáš Scholz) A D. latus adult (specimen 7-P) was collected from a human host in Geneva, and stored at -20oC in 80% ethanol. Genomic DNA was extracted at WTSI using a proteinase K digest with beads and Gentip-100.

Dracunculus medinensis (in collaboration with Mark Eberhard) DNA was prepared (Gentra Puregene DNA extraction Tissue kit) from a single D. medinensis female adult collected upon emergence from a human patient in Ghana.

Echinostoma caproni (in collaboration with Rafael Toledo) Adult E. caproni Egyptian strain parasites were obtained from experimentally infected mice in the Department, Faculty of Pharmacy of the University of Valencia (Spain). Parasites were washed extensively in PBS and subsequently fixed in 70% ethanol. Genomic DNA was isolated according to the phenol-chloroform method.

Elaeophora elaphi (in collaboration with Antonio Osuna) E. elaphi adults were dissected in 2006 from the blood vessel of a wild red deer (Cervus elaphus) by Esther S. Hernández Redondo in Corboda, Spain. A single adult worm (unknown sex) was used by Mercedes Gomez Samblas and Antonio Osuna to prepare genomic DNA (Qiagen kit).

Enterobius vermicularis (in collaboration with Pilar Foronda Rodríguez) Three adult worms were obtained from the peritoneal area of a child in Tenerife, Canary Islands. The parasites were stored in ethanol before genomic DNA was prepared with Fast DNA (BIO 101 System) kit.

Gongylonema pulchrum (in collaboration with Hiroshi Sato) Mucosa-embedded adults were isolated in bulk from a Holstein-Friesian breed in Hokkaido by fine forceps, frozen at -20˚C. DNA was extracted from a 2-3 adults using Promega Wizard.

Haemonchus placei (in collaboration with John Gilleard) Adults H. placei strain MHpl1 worms were harvested from an experimentally infected lamb at the Moredun Institute, Scotland, by Frank Jackson and stored -80oC. 50 male and 50 female adults were used by Elizabeth Redman to prepare genomic DNA following a standard phenol-chloroform protocol.

Heligmosomoides bakeri (in collaboration with Rick M. Maizels) Mixed sex, adult H. bakeri (also known as Heligmosomoides polygyrus bakeri) from a line originating from parasites supplied by Jerzhy Behnke of the University of Nottingham, were prepared previously described74. DNA was prepared with Gentra Puregene DNA extraction Tissue kit.

8 Hydatigera taeniaeformis (in collaboration with Pilar Foronda Rodríguez) Half of an adult H. taeniaeformis (formerly known as taeniaeformis) worm was isolated from the peritoneal cavity of a rat in Tenerife, Canary Islands. The parasite was stored in ethanol. Genomic DNA was prepared with Fast DNA (BIO 101 System) kit.

Hymenolepis diminuta (in collaboration with Peter Olson) The scolex and neck region from 4 adult H. diminuta () were used to prepare genomic DNA using Qiagen Genomic-tip 20/G. Specimens were derived from a laboratory strain previously maintained by Jørn Andreassen in Denmark that was seeded from the laboratory of C. Adrian Hopkins in Glasgow, who in turn received a seed culture from C.P. Read at Rice University. Genome data therefore represent the most widely employed laboratory isolate of the species.

Hymenolepis nana (in collaboration with Kazuhito Asano and Peter Olson) Adult H. nana parasites (Showa University, strain) were dissected by Kazuhito Asano from the small intestines of mice two weeks after oral inoculation of eggs. The parasite material was shipped by Akira Ito (Asahikawa Medical University) to the NHM, London where genomic DNA was prepared from multiple, whole adults using Qiagen Genomic-tip 20/G.

Litomosoides sigmodontis (in collaboration with Simon Babayan and Judith Allen) The University of Edinburgh laboratory isolate of L. sigmodontis was established from the original Museum d'Histoire Naturelle, Paris, cultures of Odile Bain. L. sigmodontis adult females (at 40 days post ) were recovered from the peritoneal cavities of infected jirds (Meriones unguiculatus) and cleaned by washing in RPMI. Individual healthy nematodes were flash frozen, ground in a mortar and pestle, and DNA extracted using a Qiagen kit. The genome assembly is based on DNA from two independent samples.

Mesocestoides corti (in collaboration with Estela Castillo) Mice infected with M. corti tetrathyridia (larval stage) were donated by Jenny Saldaña (Laboratorio de Experimentacion , Facultad de Quimica, Universidad de la Republica, Uruguay). Parasite removal and culture were performed as previously described75. After 1 day of culture, 200- 300 tetrathyridia were washed with PBS. DNA was purified by Estella Castillo and Marìa Fernanda Dominguez with DNeasy Blood & Tissue Kit (QIAGEN) and sent to WTSI.

Nippostrongylus brasiliensis (in collaboration with Rick M. Maizels) N. brasiliensis (mouse strain) adults (male and female) and a single male adult, from parasites originally supplied to Rick Maizels by Wiliam Gause of Rutgers-New Jersey School of Medicine and Dentistry, were prepared as described74. DNA was prepared with Gentra Puregene DNA extraction Tissue kit.

Onchocerca flexuosa (in collaboration with Antonio Osuna, Teresa Cruz Bustos and M. Gomez Samblas) Hundreds of mixed O. flexuosa adults were obtained by Esther S. Hernández Redondo by dissection from a wild red deer Cervus elaphus skin nodule in Sierra Morena, Corboda, Spain, in 2011 and stored in ethanol before DNA extraction (Qiagen).

9 ochengi (in collaboration with Benjamin Makepeace, Vincent N. Tanya, and Germanus Bah) Nodules were excised from skins of local Gudali Bos indicus obtained at the abattoir in Ngaoundéré, Cameroon. Nematodes liberated from the nodules were flash frozen and shipped to Liverpool, where DNA extraction was performed. Each extract was made from a single male nematode, and then checked for remaining bovine contamination. The specimen selected for sequencing had low (but significant) contamination from adherent and ingested bovine cells.

Parascaris equorum (in collaboration with Jacqui Matthews) P. equorum was obtained from a necropsied at an abattoir in Cheshire. A section of an individual adult worm was used to isolate genomic DNA using a Qiagen kit.

Protopolystoma xenopodis (in collaboration with Peter Olson) Genomic DNA was prepared from 8 mature and 19 immature P. xenopodis (South Africa) using Qiagen Genomic-tip 20/6. Specimens were collected from wild-caught Xenopus laevis in South Africa by Mathieu Badets.

Schistocephalus solidus (in collaboration with Martin Kalbe) An individual S. solidus (NST-G2 inbred laboratory strain), pleroceroid stage, was dissected sterile from a stickleback host. Genomic DNA was prepared by Irene E. Samonte using Qiagen Genomic- tip (100/G). The NST-G2 strain was obtained by within-family breeding of an out-crossed G1 family, which was produced with two worms caught in the wild at Neustädter Binnenwasser.

Schistosoma curassoni (in collaboration with Fiona Allan, David Rollinson and Aidan Emery) Isolate (NHM number 2525) was isolated from naturally infected in Dakar, Senegal. The sample used for sequencing was from the 14th passage of infection (15/6/1992), stored in liquid nitrogen and held within the SCAN repository (http://scan.myspecies.info/). Genomic DNA was recovered from an adult male worm using Qiagen DNeasy columns.

Schistosoma margrebowiei (in collaboration with Fiona Allan, David Rollinson and Aidan Emery) Isolate (NHM number 2991) was obtained from naturally infected antelope or snails in Zambia (1982). The sample used for sequencing was from the 58th passage of infection (26/10/1993), stored in liquid nitrogen and held within the SCAN repository (http://scan.myspecies.info/). Genomic DNA was recovered from an adult male worm using a standard SDS/Phenol chloroform extraction.

Schistosoma mattheei (in collaboration with Fiona Allan, David Rollinson and Aidan Emery) Isolate (NHM number 2767) was isolated from naturally infected snails (Bulinus sp.) from Denwood, Zambia (1991). The sample used for sequencing was from the 3rd passage of infection (17/9/1992) stored in liquid nitrogen and held within the SCAN repository (http://scan.myspecies.info/). Genomic DNA was recovered from an individual adult male worm using Qiagen DNeasy columns.

10 Schistosoma rodhaini (in collaboration with Fiona Allan, David Rollinson and Aidan Emery) Isolate (NHM isolate number 4039) was originally isolated from Burundi (either rodent or snails). The sample used for sequencing was from the of mice from the 6th passage of infection (3/10/2002), stored in liquid nitrogen and held within the SCAN repository (http://scan.myspecies.info/). Genomic DNA was recovered from a single adult male worm using standard SDS/Phenol chloroform extraction.

Soboliphyme baturini (in collaboration with Joseph Cook) Genomic DNA was obtained from an adult S. baturini nematode that was taken from the stomach of a Pine Marten (Martes americana, specimen MSB222066, Museum of Southwestern Biology, Dall Island, Alaska http://arctos.database.museum/guid/MSB:Mamm:222066), in collaboration with Anson Koehler.

Strongylus vulgaris (in collaboration with Jane Hodgkinson) Thirteen adult specimens of S. vulgaris were collected at necropsy from the large intestine of a naturally infected horse not exposed to for over three decades, from within a closed herd at the University of Kentucky, USA (Barn 10; Courtesy of Dr Martin Nielsen, at the University of Kentucky). Anterior ends were severed, mounted on glass slides, cleared with phenol/alcohol and identified to species based on morphological criteria. Male worms were used for DNA extraction. All specimens were stored in 70% ethanol at −20 °C. DNA was prepared using DNeasy Blood and Tissue Kit (Qiagen).

Syphacia muris (in collaboration with Rafael Toledo) Adult S. muris parasites were collected from experimentally infected rats from the animal holding of the Faculty of Pharmacy of the University of Valencia, Spain. The adult parasites were washed extensively in PBS and fixed in 70% ethanol. Genomic DNA was extracted using the phenol- chloroform method.

Taenia asiatica (in collaboration with Keeseon Eom) A single T. asiatica () adult worm was collected from a human patient. Genomic DNA was prepared from fragments of gravid proglottids using Qiagen Genomi-tip 20/G.

Teladorsagia circumcincta (in collaboration with Stewart Bisset) The strain sequenced was developed by subjecting an anthelmintic-susceptible laboratory isolate, originally derived from field-grazed lambs in New Zealand in the 1950s, to two generations of in- breeding (sibling mating) in order to minimize genetic diversity. DNA was isolated from multiple adult worms. The assembly and annotation version for the genome used in the present study is an earlier verion of the one submitted to GenBank (Accession: PRJNA219637)76.

Thelazia callipaeda (in collaboration with Manuela Schnyder) University of Zurich, Switzerland. Adult T. callipaeda, collected from a dog host in Ticino 77 were stored in ethanol and shipped to WTSI. Genomic DNA was isolated from one adult worm using Promega Wizard genomic DNA kit.

11 (in collaboration with Philip Cooper) 30-45 day-old puppies were treated with (180 mg/kg weight) followed by milk ad libitum, and adult T. canis worms were collected over the following 24 hours. Adult worms were washed profusely with sterile saline, dried, sexed, and then stored at -80℃. Genomic DNA was isolated from a single adult male worm using Qiagen Genomic-tip-20.

Trichinella nativa (in collaboration with Dante Zarlenga) T. nativa isolate ISS45 (International Reference Center; http://www.iss.it/site/trichinella/) was passaged in Swiss Webster mice approximately one to two times per year since 1986. Genomic and transcriptomic DNA were isolated and confirmed to be T. nativa by PCR.

Trichobilharzia regenti (in collaboration with Petr Horak) T. regenti cercariae from a mono-miracidial infection of Radix lagotis were collected, fixed in 96% ethyl-alcohol and stored at -80°C before being shipped to WTSI, where genomic DNA was extracted using the Promega Wizard kit.

Trichuris suis (in collaboration with Joseph F. Urban and Dolores E. Hill) A T. suis strain78 was used that has been actively passed in pigs, one to two times per year since the early 1990s at the USDA in Beltsville, MD. The strain was originally derived from pigs naturally infected on dirt lots. Adult worms were isolated from pigs placed on the contaminated land and naturally infected or experimentally inoculated with eggs developed in vitro. The T. suis adults were manually removed from the cecum and proximal colon tissue and cultured in vitro to release fertilized eggs that were removed after 24-48 hours and embryonated to an infective stage.

Wuchereria bancrofti (in collaboration with Rick M. Maizels and Felix Partono) W. bancrofti L3 (Jakarta isolate), were collected from a single batch of Aedes togoi mosquitoes fed 16 days earlier on blood from an infected patient by Felix Partono of the Department of Parasitology, University of Indonesia, Jakarta in 1981 and stored at -20 ℃. DNA was prepared with Gentra Puregene DNA extraction Tissue kit.

2.1 Data production, Wellcome Trust Sanger Institute (WTSI) The genomes of 36 species were sequenced at the Wellcome Trust Sanger Institute (WTSI) (Supplementary Tables 1 and 22).

DNA template preparation and sequencing Where genomic DNA (gDNA) of sufficient yield and quality was available, a PCR-free short insert library and a 3 kb large insert mate pair library were both made. Amplification-free 400-550 bp paired end Illumina libraries were prepared from <0.1 ng to 5 µg gDNA using a previously described method79 except that using Agencourt AMPure XP beads for sample clean up and size selection. Genomic DNA was precipitated onto beads after each enzymatic stage with an equal volume of 20% Polyethylene Glycol 6000 and 2.5M sodium chloride solution. Beads were not

12 separated from the sample throughout the process until after the adapter ligation stage: fresh beads were then used for size selection. Where there was insufficient DNA to produce a PCR-free library, adapter-ligated material was subjected to the minimum number of PCR cycles possible to obtain a satisfactory outcome (usually 8 cycles), employing indexed oligonucleotide primers and Kapa HiFi polymerase. Between 1μg - 10 μg genomic or WGA DNA was used to generate 3 kb mate pair libraries using a modified SOLiD 5500 protocol adapted for Illumina sequencing80. When insufficient gDNA was available for 3 kb mate pair libraries, whole genome amplification (WGA) was performed using GenomiPhi v2 (GE life science). in a 20 µl reaction mixture at 30°C for 90 min followed by alkaline denaturation according to the manufacturer's instructions. Amplified products were purified using a QIAAmp DNA mini kit (Qiagen) and used as input for mate pair library preparation. Each library was run on at least one HiSeq 2000 lane using proprietary reagents according to the manufacturer's recommended protocol (https://icom.illumina.com/). Data was analysed from the Illumina HiSeq sequencing machines using the RTA1.8 analysis pipelines.

Genome assembly A standardised workflow was developed to assemble 36 draft genomes, plus resequencing data for C. elegans N2 (Supplementary Fig. 15a) as an assembly control. Short insert paired-end sequence reads were corrected and initially assembled with SGA assembler v0.9.781. This draft assembly was used to calculate the distribution of k-mers for all odd values of k between 41 and 81, using GenomeTools v.1.3.782. The k-mer length for which the maximum number of unique k- mers were present in the SGA assembly was then used as the k-mer setting in a second assembly using Velvet v1.2.0383 with the corrected reads from SGA. For species with 3 kb mate-pair sequence data, the Velvet assembly was scaffolded using SSPACE84. Contigs were extended, gaps were closed and shortened, using Gapfiller85 and then IMAGE86. The short fragment reads were remapped to the assembly using SMALT (www.sanger.ac.uk/science/tools/smalt-0), and a ‘bin’ assembly using the unaligned reads was generated using Velvet83 as described above, and incorporated into the main assembly. The merged assembly was re-scaffolded using SSPACE84. The consensus base quality of the assembly was improved with iCORN87. REAPR88 was run to detect and break apart incorrectly assembled scaffolds and contigs, thereby increasing their accuracy and avoiding inflated assembly metrics (e.g. N50) from being reported. We carried out minimal manual improvement for two species ( and medinensis), by using Gap589 to manually extend and link scaffolds using Illumina read pairs. The genome assemblies were screened to remove likely contamination with non-target material (see section Contamination screening below) and then used for gene finding and subsequent analysis.

13 Assembly QC

Resequencing and assembly of the N2 strain To check for miss-assemblies in our draft C. elegans assembly, it was aligned to the WormBase reference genome (WS235) using nucmer90. This gave 42,171 alignment blocks that covered 98.4 Mb of the 101.2 Mb new assembly. Since the new assembly was from the same strain as the reference genome, the alignment blocks were filtered to only retain those with ≥99.9% identity, and to only keep the best alignment for each region of a scaffold in the new assembly (delta-filter -q option). The resulting 5,283 stringently aligned blocks covered 81.2 Mb and ranged from 65 bp to 386 kb in size. Thirty-seven alignment blocks (ranging from 65–95,240 bp but with 24 blocks shorter than 5 kb) overlapped by ≥90% with an adjacent longer block and were discarded, leaving 5,246 alignment blocks that covered 81.2 Mb of the 101.2 Mb new assembly. To identify interchromosomal miss-assemblies, the alignment blocks were ordered with respect to their positions along the scaffolds of our new assembly. Considering alignment blocks ≥ 50 kb (453 alignment blocks, which covered 40.3 Mb of our assembly), there were no cases where one scaffold in the draft assembly aligned to two different in the reference assembly. Thus, amongst the long scaffolds there are no obvious interchromosomal miss-assemblies. However, 254 interchromosomal rearrangement breakpoints were found using scaffolds <50 kb. To find intrachromosomal rearrangement breakpoints, each scaffold with ≥3 alignment blocks to reference chromosomes were considered (since two alignment blocks can only be in the same order in our assembly as in the reference). To find intrachromosomal rearrangements between a particular scaffold and reference , alignment blocks were numbered along the scaffold (e.g. 1 2 3 4 5) and breakpoints found between stretches of consecutively numbered blocks along the reference chromosome (e.g. two breakpoints: 1 | 5 4 | 2 3). This approach did not consider the orientation of alignment blocks, so inversions spanning just a single block were not detected. However, this method provided an estimate of 93 intrachromosomal rearrangement breakpoints.

Assembly completeness To assess the completeness of the 36 assemblies, CEGMA v2.491 was run using default settings. It was observed that for some phylogenetic groups, consistent sets of genes were missing, despite otherwise very complete assemblies. These included 6 genes absent from Trichinelloidea (Trichinella and species; KOG1047, KOG2803, KOG1291, KOG2948, KOG0756, KOG3285), 5 genes absent from all (KOG1468, KOG2555, KOG2303, KOG1185, KOG2770; as previously reported for 4 of these 592, 14 genes absent from both cestodes and trematodes (KOG0622, KOG1185, KOG0602, KOG2531, KOG1755, KOG0567, KOG2785, KOG0047, KOG1390, KOG1430, KOG2555, KOG1568, KOG0344, KOG1468), 6 additional genes missing only from cestodes (KOG3164, KOG0650, KOG1068, KOG2311, KOG3180, KOG1535) and 7 additional genes missing only from trematodes (KOG1562, KOG0313, KOG1980, KOG3237, KOG1539, KOG0788, KOG1123). These gene sets were discounted from the completeness calculation for their respective species, in order to reflect that in these lineages, these genes are either lost or are diverged to the point that they cannot be identified by the standard CEGMA analysis.

14 Effect of repeats on assembly size To investigate whether repeats have been collapsed in the assemblies of the 36 species sequenced at WTSI, we re-mapped the short-insert library’s reads to the appropriate assembly using SMALT (H. Ponstingl, http://www.sanger.ac.uk/science/tools/smalt-0; using indexing options -k13 -s4 and mapping options -y 0.9 -x -r 1). For H. bakeri, two short-insert libraries were available, and just the library with the largest number of mapped reads was used here. Scaffolds smaller than 8 kb were discarded due to excessive noise in read depth estimates. As previously 53 described , median per-base read-depths meds for each scaffold were calculated using the 93 BEDTools function genomecov and, from these the genome-wide read depth medg was calculated as the median of medians.

Collapsed repeats are regions where near-identical repeat units from the genome are represented by a smaller number of repeats in the assembly but at artefactually high sequence depth. To detect collapsed repeats, the mean (ms) and median (meds) per-base read depths were compared as follows. For a scaffold of length ls bp, the amount of extra sequence (es) that would be gained by

‘uncollapsing’ repeats was estimated as (ms - medg)*ls/medg. Scaffolds with unusually high read depth were identified as those that had mean coverage ≥1.2-fold of the median coverage across the genome (i.e. ms ≥1.2*medg), and had ≥10 kb of extra sequence (es ≥10 kb). The total amount of extra sequence e that would be gained by ‘uncollapsing’ repeats was estimated as the sum of the extra sequence es for all such scaffolds of ≥ 8 kb. If the original assembly size for a species was a, the new estimate for the assembly size was e+a (Supplementary Table 5).

Gene prediction Gene predictions were generated using MAKER version 2.2.2894. The MAKER annotation pipeline consisted of four steps, taking into account evidence from multiple sources (Supplementary Fig. 16a). First, repetitive elements in each genome were identified and masked using RepeatMasker (www.repeatmasker.org) by scanning scaffolds for matches to repeats from a repeat library generated using RepeatModeler (www.repeatmasker.org/RepeatModeler.html). Second, ab initio gene models to be used as evidence within MAKER were generated using Augustus 2.5.595, GeneMark-ES 2.3a (self-trained)96, and SNAP 2013-02-1697. Further gene models used as MAKER input were generated using comparative algorithms genBlastG98 (which used comparisons to C. elegans gene models from WormBase99) and RATT100 (which transferred gene models from the taxonomically nearest published ‘reference’ genome from the list: Haemonchus contortus for clade V parasites; suum for Ascaridomorpha; (and ) for group IIIc (Spiruromorpha+other); Trichuris muris for clade I; Strongyloides ratti for clade IV; for cestodes except multilocularis for Taenia species; for trematodes). Third, species-specific ESTs and cDNAs from INSDC101, and proteins from related species (see below), were aligned against the genomes using BLASTN and BLASTX102, respectively, and these alignments were further refined with respect to splice sites using exonerate103. Last, the EST and protein homology alignments, comparative gene models,

15 and ab initio gene predictions were integrated and filtered by MAKER to produce a gene set for each species, with just one transcript for each gene. The four-step MAKER pipeline was run three consecutive times. The first run was performed using the est2genome option with species-specific ESTs and cDNAs and the protein2genome for nematode protein sequences from UniProt’s UniRef 90 clusters for nematodes 104. For this first MAKER run, Augustus and SNAP were trained using CEGMA105 gene models for KOGs, as well as ‘nematode orthologous groups’ (NOGs), ‘trematode orthologous groups’ (TROGs), or ‘cestode orthologous groups’ (CEOGs) as appropriate. The NOGs106 and TROGs were identified using OrthoMCL107 to cluster proteins from the full proteomes of 12 nematode species, four trematode species, one cestode and several eukaryotic outgroups. NOGs were defined as clusters containing at least one member from each of the 12 nematodes, and TROGs as those containing at least one member from each of the four trematodes. Since the original OrthoMCL clustering only included one cestode, a separate clustering was performed to define CEOGs, using four cestode species and several eukaryotic outgroups. An HMM was built for each cluster using HMMER (http://hmmer.org). Gene models obtained from the first MAKER run were used to train SNAP, and MAKER was run a second time, using the same nematode proteins as in the first run. Gene models from the second run were then used to train Augustus. Using the trained versions of SNAP and Augustus, MAKER was run a third time, using a taxonomically broader protein set that included proteins from metazoans with complete proteomes from UniProt and a subset of proteins from helminths from GeneDB108. The resulting MAKER gene set was filtered to remove less reliable gene models, as follows. Firstly, any MAKER gene models that were based on exonerate or BLASTX alignments, and did not overlap any Augustus, genBlastG or RATT gene model, were discarded, as they were probably due to spurious alignments. Secondly, MAKER gene models that encoded proteins of shorter than 30 amino acids were discarded. Thirdly, if two different MAKER gene models overlapped in their coding sequence, the gene model with the worst MAKER score (i.e. AED score) was discarded.

Contamination screening For some of the species sequenced, sequencing reads were contaminated with those of other species, either arising from DNA of the host species, other species that are commensal in the host, or from laboratory contamination. To remove contaminant scaffolds from the assemblies, a multi- step approach was taken: Step 1. Each scaffold was split into 50 kb chunks, and for each chunk BLASTX was run against databases of (i) all invertebrate proteins from GenBank, and (ii) proteins from representative species from major non-invertebrate taxa (bacteria, , fungi, plants, etc.). For a particular chunk, if the e-value for its top non-invertebrate hit was 1E+10 fold lower than the e-value of its top invertebrate hit (e.g. E-60 versus E-50), the chunk was considered to be contaminant. If more than half of the chunks of a scaffold were classified as contaminant, the whole scaffold was considered contaminant and was removed from the assembly.

16 Step 2. BLASTP searches of predicted proteins from (non-contaminant) scaffolds remaining after step 1 were run against the search databases. For each protein, if its top BLASTP hit was to a non-invertebrate protein, and had an e-value that was 1E+50 times lower than that of the best invertebrate hit, then the gene was considered a putative contaminant gene. Conversely, if the top hit was to an invertebrate protein, and its e-value was 1E+50 times lower than that of the best non- invertebrate hit, the gene was classified as non-contaminant. If more than half of the classified genes on a scaffold were considered contaminant, then the scaffold was classified as contaminant and removed from the assembly. Step 3. A more stringent version of the second step removed contamination originating from other invertebrates (for example, platyhelminth contamination in a nematode assembly), as well as any residual contamination from non-invertebrates (e.g. bacteria) not removed by the first two steps. From the non-contaminant scaffolds that remained after step 2, protein sequences were BLASTP- searched against a databases of the non-invertebrate proteins from steps 1 and 2, plus either nematode or platyhelminth protein sequences. For each platyhelminth query gene, the top five BLASTP hits in the nematode/non-invertebrate database and in the platyhelminth database were recorded. If the top five of these ten BLAST hits were to nematodes/non-invertebrates, and the e- value of the worst nematode/non-invertebrate hit was at least 5 orders of magnitue lower than the e-value of the best platyhelminth hit, it was considered to be a contaminant. Conversely, if the top five of the ten hits were to platyhelminth, and the e-value of the worst platyhelminth hit was 5 orders of magnitue lower than that of the best nematode/non-invertebrate hit, it was considered a non-contaminant. If a scaffold had one or more contaminant genes, and no non-contaminant genes, it was considered to be a contaminant scaffold and removed. Step 4. ( only) For Anisakis simplex, the assembly produced by step 3 appeared to contain a low level of contamination by Schistosoma mansoni sequence (1-2% of bases), probably due to laboratory contamination, as some scaffolds appeared to consist of S. mansoni repetitive (e.g. transposable element) or mitochondrial sequence. An early version of the Compara database was analysed (see Compara Database of Gene Families below) to identify A. simplex genes that belong to gene families, and which lie in the expected position of the gene tree according to the species tree. These genes were considered to be putative non-contaminant (A. simplex) genes. If a scaffold contained no non-contaminant genes, it was considered a putative contaminant scaffold. All putative contaminant scaffolds were BLASTN-searched against the S. mansoni and Caenorhabditis elegans genomes. If the top hit of a scaffold was to S. mansoni, and its e-value was at least 1E+20 times lower than the top C. elegans hit, the scaffold was considered to be contaminant and removed from the assembly. For most species, <5% of the assembly was classified as contaminant, but seven species had more contamination: (11.6%), Elaeophora elaphi (10.3%), Heligmosomoides bakeri (7.4%), Anisakis simplex (7.3%), Cylicostephanus goldi (6.0%), Syphacia muris (5.5%), and Wuchereria bancrofti (4.6%). These assemblies may still contain some small contaminant scaffolds, which are hard to detect. In particular, the contamination scan pipelines were largely based on finding contaminant genes, so may have missed contaminant scaffolds that lacked coding sequences.

17 If the contaminant scaffolds comprised ≥5% of bases in the original assembly, gene-finding was re- run from scratch for that species, in case the contaminant scaffolds had affected the training of gene-finders used to create input gene sets for MAKER.

2.2 Data production, McDonnell Genome Institute (MGI) The genomes of six species were sequenced and assembled at McDonnell Genome Institute (MGI) (Supplementary Tables 1 and 22).

Genome sequencing library preparation

Paired end short insert libraries 454 titanium fragment libraries were constructed with 5-10 μg of DNA according to the manufacturer's recommendations (Roche 454). Illumina small-insert paired-end libraris were prepared according to the manufacturer's protocol with the following exception that multiple library enrichment reactions and size selection were performed after amplification and multiple size fractions (300-400 and 400-500 bp) were collected.

454/Illumina 3 kb insert mate pair libraries DNA was sheared into 3kb fragments, blunt ended and ligated to the SOLiD Mate-Pair Cap Adapter (ABI). Ligated DNA was size fractionated to 2-4 kb fractions and then purified. Circularization reactions were set up using 1 μg of the extracted fraction and 1.3 pmol of the Internal SoliD Mate Pair adaptor (ABI). Linear (or non-circularized and nicked) fragments were removed, circularized fragments were nick-translated extending from gaps engineered within the cap adapter using 200 ng of library and 20 units of DNA polymerase I. Nick-translation reactions were treated with S1 nuclease for 15 minutes. Resulting products were blunt ended and immobilized using Dynal M270 Streptavidin beads (Invitrogen). For 454 sequencing, FLX Titanium paired-end library adaptors were ligated onto the immobilized DNA fragments and processed as recommended by the Manufacturers 3 kb span paired end library construction protocol (Roche 454). For Illumina sequencing, blunt ended fragments were processed through an adenylation reaction. Illumina’s Truseq adaptors were ligated, the library was enriched with KAPA HiFi polymerase (KAPA Biosystems) and a final dual SPRI size selection was performed to isolate 300-500 bp library fragments.

454/Illumina 8 kb insert mate pair libraries 8 kb span paired-end libraries were constructed for 454 sequencing according to the manufacturer’s recommendations (Roche 454), except that the 6.5-9 kb fraction was extracted from the size selection gel and the extracted adaptor ligated DNA was purified using a Qiagen Gel Extraction Kit. For Illumina sequencing, 15 μg of high molecular weight DNA was sheared to a mean fragment size of 8 kb with a Hydroshear, blunt ended using DNA Terminator End Repair Kit (Lucigen) and ligated with 20 µM Circularization Adapters (Roche). The ligated DNA was size-

18 fractionated and the 6.5-10 kb fraction was purified using the Qiagen Gel Extraction Kit. 300 ng of size selected DNA was circularized using 10U of Cre Recombinase. Linear (or non-circularized and nicked) library fragments were removed. The circularized library fragments were fragmented targeting a mean insert size of 300 bp. The fragmented DNA was blunt ended using the DNA Terminator End Repair Kit (Lucigen), processed through an adenylylation reaction (NEB) and Illumina's TruSeq adaptors were ligated. The adenylated fragments were immobilized with Dynal M270 Streptavidin beads and amplified with KAPA HiFi Polymerase (KAPA Biosystems). The final 300-500 bp library fragments were selected with a dual SPRI reaction. Genomes sequenced on the Roche/454 platform were assembled from a combination of fragment reads, 3 kb paired-end reads and 8 kb paired-end reads generated to meet the coverage criteria of 15x, 15x and 3x respectively, with a target of 30x coverage for the final assembly. Genomes sequenced on the Illumina platform had overlapping fragment reads, 3 kb and 8 kb paired-end reads and were sequenced to a depth of 45x, 45x, and 10x, respectively.

Genome assembly Assemblies were generated using the assembly workflows outlined in Supplementary Fig. 15b, with the specific method depending on the input material. Information on the specific workflow that applies to each assembly can be found in Supplementary Table 22. Assembly information is detailed in Supplementary Table 1.

454 data A combination of 3kb, 8kb and fragment 454 reads were subject to adapter removal, quality trimming and length filtering using a combination of the Flexbar109 and Trimmomatic110 tools (Supplementary Fig. 15b). Contaminant screening was done using the Bowtie2 aligner and a local contaminant database containing ribosomal RNA, bacteria and host sequence. The cleaned reads were then assembled using the Newbler assembler111 before being scaffolded with an in- house tool CIGA which links contigs based on cDNA evidence. The resulting assembly was improved using another local tool named Pygap that uses Illumina short paired end sequences to help fill gaps between scaffolded contigs. Finally the L_RNA_scaffolder112 used 454 cDNA data to further improve scaffolding.

Illumina data 3kb, 8kb and fragment Illumina sequences were subject to the adapter removal, quality trimming, length filtering and contamination screening process described for the 454 data above (Supplementary Fig. 15b). The cleaned reads were then assembled using the AllPaths-LG assembler113 before being improved using the Pygap and L_RNA_scaffolder tools as described above.

Trichinella nativa assisted assembly The Trichinella nativa genome was assembled with an assisted assembly approach. Illumina 3kb paired end sequence data were subject to the same 'cleaning' procedure described above and was then mapped against the reference (ABIR00000000) using the bwa aligner114 using default parameters (Supplementary Fig. 15b). T. nativa data aligned to 82.1% of the reference genome to a depth of ~126.6x. Samtools mpileup was run on the alignment along with

19 vcfutils.pl varFilter using suggested argument settings to identify the differences between the T. nativa reads and the T. spiralis backbone. 1,053,763 SNPs were identified along the region of the T. spiralis reference matching T. nativa. We then subtracted out the T.spiralis reference from the mapped region by omitting all SNP loci where the alternate allele frequency was 1, leaving us with 169,259 SNP loci within our putative T.nativa sequence. We then used FastaAlternateReferenceMaker method of GATK (http://software.broadinstitute.org/gatk/) and a bed file comprising only regions where T. nativa mapped to the T. spiralis reference to construct a T. nativa consensus populated with the T. nativa allele at each detected SNP . In addition, reads that did not initially map to the T. spiralis reference were assembled using Velvet83, with a kmer size of 39 chosen by the VelvetOptimizer (http://bioinformatics.net.au/software.velvetoptimiser.shtml). BLAT115 was then used to compare the contigs created by Velvet to the contigs created by alignment to the T. spiralis reference and all Velvet contigs greater than 500 bp that mapped less than 50% of their length (and at >80% identity) to an existing contig were added to the assembly.

Assembly QC / Contamination screening All assemblies were screened, to remove for contamination, before annotation. Adaptor sequencs and contaminants were identified by compared contigs to a database of vectors, bacterial and host contaminants using Megablast. High-scoring segment pairs (HSPs) with E-value <0.01 and length >100 bp were picked. The final alignment length is from the first base of the first HSP to the last base of the last HSP. The contig was removed if the identity was greater than 75% and the coverage was greater than 40% of the contig, or the contig was less than 2000 bp. Any contigs which were on the border of the requirements and longer in length were manually reviewed as an extra measure against true genome contigs being removed.

Transcriptome sequencing and assembly Transcriptome (RNAseq) libraries (Supplementary Table 23) were generated with the Illumina TS stranded protocol following the manufacturer’s guidelines. Raw reads were cleaned using an in- house tool that trims adaptor, quality trims and applies a length filter using Flexbar 109 and Trimmomatic110. Low complexity sequence was masked using the filter_by_complexity tool in the seq_crumbs package (http://bioinf.comav.upv.es/seq_crumbs/), and contaminating sequences were identified using Bowtie2116 and TopHat2117 before being removed using local code. The cleaned, filtered RNAseq reads were assembled de novo with Trinity118, using both left and right cleaned paired reads. The output was filtered for the longest representative open reading frame, resulting in a ‘best candidates’ file. Transcripts were merged using CD-HIT119 with 98% coverage and identity. The assembled contigs were assessed for quality by aligning (with TopHat117) back to reference assembly to establish the percentage of reference aligned to by the reads and the percentage of reads that aligned to the reference. The assembled RNASeq data were used alongside EST data in the MAKER stage of gene prediction.

Gene prediction For each assembly a repeat library was generated using RepeatModeler. Ribosomal RNA genes were identified using RNAmmer (http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?rnammer) and

20 transfer RNAs were identified with tRNAscan-SE120. Non-coding RNAs, such as microRNAs, were identified by searching against Rfam121. Repeats and predicted RNAs were then masked using RepeatMasker. Protein-coding genes were predicted using a combination of several ab initio programs: SNAP97, FGENESH (Softberry, Corp), Augustus95 and the MAKER pipeline122, which aligns mRNA, EST and protein evidence from the same species or cross-species to aid in gene structure determination and modifications (Supplementary Fig. 16b). SNAP and Augustus models were generated where possible using the MAKER pipeline and species-specific evidence. A consensus gene set from the above prediction algorithms was generated, using a logical, hierarchical approach developed at MGI.

High confidence gene selection A high confidence gene set was created from MAKER122 output. First, Quality Index (QI) criteria were calculated as follows: (i) length of the 5' UTR; (ii) fraction of splice sites confirmed by an EST; (iii) fraction of exons that overlapped an EST alignment; and (iv) fraction of exons that overlapped EST or protein alignments. Second, these decision-making steps were followed: a) Genes were screened for overlaps (<10% overlap was allowed). b) If QI[2] and QI[3] were >0, or QI[4] was >0, then the gene was kept. c) Genes were retained if they matched Swissprot123 using BLAST (E<1e-06). d) Genes were retained if they matched if they matched Pfam124 using RPSBLAST (E<1e- 03). e) RPSBLAST was run against CDD125 (E<1e-03 and coverage >40%). Genes that met both cut-offs were kept. f) If no hit was recorded, the gene was retaind if it had ≥ 55% identity to the genes database from KEGG126, and and a bitscore of ≥35.

Additional curation of gene sets Depending on the nature of the final gene set in relation to the assembly quality some gene sets underwent an additional manual review of short genes lacking definitive evidence. After the high confidence gene selection steps described above, shorter single and double exon genes and genes annotated as hypothetical (with no KEGG nor InterPro homologies) were further scrutinized. A manual review of the Annotation Edit Distance (AED, from MAKER) was considered in combination with the QI scores (all provided by MAKER), enabling analysts to make a more informed decision about whether to keep or discard each such gene.

2.3 Data production, Blaxter Nematode and Neglected Genomics (BaNG) The genomes of three species were sequenced by BaNG (Supplementary Tables 1 and 22).

Genome sequencing library preparation and sequencing All genome sequencing was carried out on Illumina HiSeq 2000 and HiSeq 2500 instruments, using 100 or 125 base, paired end protocols. Illumina paired end libraries were generated using the Illumina TruSeq protocol, following manufacturer's instructions. For Litomosoides sigmodontis,

21 three libraries were sequenced with insert sizes of ~300 bp, 600 bp and 600 bp. For A. viteae, three sequence datasets were generated from libraries with 350 bp insert sizes.

Genome assembly Raw data were filtered of contaminating host reads using blobtools127. Cleaned reads were digitally normalised with the khmer software using a kmer of 41, and then they were assembled with ABySS (v 1.3.3)128 with 3 minimum pairs needed to connect contigs during the scaffolding phase (n=3).

Assembly QC Assemblies were assessed using blobtools and the CEGMA105 pipeline, detailed in Supplementary Fig. 15c.

Gene prediction Augustus95 was used to predict gene models, trained using a GFF-format annotation file generated by MAKER94. L. sigmodontis 454 RNA-seq data were assembled with MIRA and Newbler111. Trinity129 was used to assemble O. ochengi Illumina RNA-Seq data130 that was subsequently used as hints in MAKER (Supplementary Fig. 16c).

3 Functional annotation

Assigning protein names to predicted proteins To assign protein names to each predicted protein (e.g. Brugia timori), UniProt protein naming rules (http://www.uniprot.org/docs/nameprot) were followed where possible. To identify an ortholog with a manually curated protein name, in UniProt (taking human, zebrafish, Drosophila melanogaster, Caenorhabditis elegans, and Schistosoma mansoni orthologs) or GeneDB108 (S. mansoni orthologs), one-to-one or many-to-one (e.g. many-B. timori-to-one-C. elegans) orthologs were first identified based on phylogenetic trees in our in-house Compara15 database (see Compara database of gene families below). The correspondence between UniProt accessions and Ensembl accessions (used in our Compara database) was downloaded from the UniProt website. In order of preference, the ortholog was selected from: C. elegans, S. mansoni, human, D. melanogaster and then zebrafish. If an ortholog with a manually curated protein name from the most preferred species (C. elegans) was not found, the second-most preferred species (S. mansoni) was checked, and so on. From UniProt, the ‘recommended name’ (RN) of the ortholog was used, while from GeneDB the ‘product description’ was used. If no ortholog with a manually curated protein name was identified, then an ortholog with a non-curated protein name (i.e. from a TrEMBL entry123) was sought. The selected protein names (e.g. ‘caveolin’) were transferred to predicted proteins (e.g. in B. timori) and the UniProt/GeneDB accession of the source protein was recorded, along with the evidence code ECO:0000265 (‘sequence orthology evidence used in automatic assertion’), from the Evidence Code Ontology (ECO; http://www.evidenceontology.org). If several genes in the

22 species of interest (e.g. B. timori) were assigned the same protein name (for example, because of many-to-one orthology to the same C. elegans gene), they were numbered (e.g. ‘caveolin-1’, ‘caveolin-2’, etc.) to ensure they were given unique names. If a particular query protein (e.g. from B. timori) was not assigned any protein name based on its orthologs, then a protein name was assigned based on InterPro131 domains in the protein (e.g. ‘ repeat and SAM-domain-containing protein’). The InterPro accession(s) of the source domains were noted, and the evidence code for the protein name was recorded as ECO:0000259 ('match to InterPro signature evidence used in automatic assertion'). If a query protein was not assigned a protein name based on either orthologs or InterPro domains, it was named ‘Hypothetical protein’. The protein names were added to the protein fasta file headers for each species. Taking all predicted proteins for the 45 new genomes, 37.3% were assigned names based on UniProt entries (of which 47.6% were from C. elegans orthologs and 12.5% from S. mansoni), 22.1% based on InterPro domains, and the remainder were labelled ‘Hypothetical protein’. Of the unique names transferred from UniProt (some of which were given to several predicted proteins from different species), 49.6% were transferred from curated (‘reviewed’) UniProt entries and the other 50.4% from uncurated (‘unreviewed’) entries.

Assigning GO terms to predicted proteins To assign (GO) terms to the predicted proteins from each species, GO terms were transferred from their human, zebrafish, C. elegans, and Drosophila melanogaster orthologs, as described below. Manually curated GO annotations for human, zebrafish, C. elegans, and D. melanogaster were downloaded from the Gene Ontology website132 and filtered to exclude annotations not based on experimental evidence (i.e. only those with evidence codes IDA/IEP/IGI/IMP/IPI were retained), annotations with a ‘NOT’ qualifier, and annotations to the GO:0005515 (‘protein binding’) term, following the criteria used by the Compara project for projecting GO terms to orthologs15. For each predicted protein in a particular species (e.g. Brugia timori), all orthologs (including one- to-one, one-to-many, and many-to-many orthologs) of the gene in human, zebrafish, C. elegans, and D. melanogaster were identified based on phylogenetic trees in our in-house Compara database (see Compara database of gene families below). To assign GO terms to a particular query gene (e.g. from B. timori), we identified the human, zebrafish, C. elegans and D. melanogaster orthologs that have manually curated GO terms. Taking each pair of orthologs (A, B) from two different species (e.g. a C. elegans ortholog and a D. rerio ortholog, but not two C. elegans orthologs), we used a breadth-first search algorithm to find the last common ancestors of their GO terms in the GO hierarchy. For example, if A has GO terms [A1, A2, A3] and B has GO terms [B1, B2], we found the last common ancestors of A1+B1, A1+B2, A2+B1, A2+B2, A3+B1, and A3+B2. The GO terms assigned to the (e.g. B. timori) query gene were the union of the last common ancestors of GO

23 terms for all pairs of orthologs from two different species. We removed any GO term from this set that is an ancestor (in the GO hierarchy) of another term in the set. GO terms of the three possible types (molecular function, cellular component and biological process) were assigned to the query protein (e.g. from B. timori) in this way. The UniProt accession of the source (ortholog) protein was noted, and the evidence code for the GO terms was recorded as IEA (‘inferred from electronic annotation’). To maximise the amount of GO annotation, terms were transferred from all orthologs, not just one- to-one orthologs (and therefore different usage to annotation of vertebrate orthologs by Ensembl Compara15). The Compara pipeline is designed to transfer GO terms between relatively closely related vertebrate species. A new pipeline was therefore developed to transfer GO terms across animal phyla (e.g. from D. melanogaster or human to a helminth). Instead of transferring GO terms directly between orthologues, the last common ancestor terms of orthologs from two different species (e.g. a C. elegans ortholog and a D. rerio ortholog) were transferred. These GO terms are more likely to be conserved across the more distantly related species in this data set, and thus more likely to be accurate predictions for the query protein (e.g. from B. timori). For each query protein, GO terms were also identified using InterproScan133, which identifies InterPro131] domains in the protein and maps GO terms to the domains. The InterPro accession(s) of the source domains were noted, and the evidence code for the GO terms was recorded as IEA.

4 Repeat libraries and repeat-masking

RepeatModeler repeat libraries For the 45 species sequenced, we built repeat libraries using RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html). For species sequenced at WTSI, these were filtered to remove repeats that had BLASTN hits of E≤0.001 against known protein-coding genes and ncRNA genes from nematodes and platyhelminths (DNA sequences for C. elegans protein- coding and ncRNA transcripts from WormBase WS235134, and for S. mansoni and E. multilocularis protein-coding transcripts from GeneDB108). For previously sequenced helminth species, repeat libraries were obtained from WormBase ParaSite135 or WormBase where they were generated using RepeatModeler for Pristionchus pacificus, Panagrellus redivivus, Trichinella spiralis, Trichuris muris, , and generated de novo for S. mansoni . If it had not already been done, we ran RepeatClassifier (part of the RepeatModeler software) to classify the repeats in each library.

TransposonPSI repeat libraries In addition to RepeatModeler, repeats were identified using TransposonPSI (http://transposonpsi.sourceforge.net). TransposonPSI uses PSI-BLAST to search for sequence matches at the DNA or protein level to proteins encoded by transposable elements. Sequences < 50 bp were removed from the TransposonPSI repeat library and RepeatClassifier was used to classify the these repeats.

24 LTRharvest/LTRdigest repeat libraries Long terminal repeat (LTR) retrotransposons were also identified using LTRharvest136 To remove likely false positives, and make a library of full-length (or mostly full-length) LTR retrotransposon sequences, LTRdigest was run on the LTRharvest results, with selected protein HMMs from Pfam137 (the specific Pfam domains are listed in tables B1 and B2 of the LTRdigest publication138) and GyDB139. Any LTR retrotransposon candidates without domain hits were removed (to further remove likely false positives). Repeats in the resulting repeat library were classified with RepeatClassifier.

Repeat libraries from merging the RepeatModeler, TransposonPSI and LTRharvest libraries The repeat libraries from RepeatModeler, TransposonPSI and LTRharvest were merged by using USEARCH v7140 to cluster the candidate sequences with ≥80% identity, and remove all but one sequence for each cluster. The resultant repeat library should be non-redundant. To try to remove protein-coding genes from non-transposable element genes (e.g. repeats), the non- redundant library was then filtered to remove repeats that were classified as ‘unknown’ (by RepeatClassifier) and that had BLASTN hits of E≤0.001 against known protein-coding genes and ncRNA genes from nematodes and platyhelminths (DNA sequences for C. elegans protein-coding and ncRNA transcripts from WormBase WS235134, and for S. mansoni and E. multilocularis protein-coding transcripts from GeneDB108). We built a merged repeat library for each of the 81 helminth species, not just for the 45 species that we sequenced.

RepeatMasking The non-redundant library for a species (made by merging the libraries from RepeatModeler, TransposonPSI and LTRHarvest) was used to mask repeats in that species’ genome using RepeatMasker (http://www.repeatmasker.org/), with the –s (sensitive) option. This will have masked low complexity DNA and simple repeats as well as transposable elements.

5 Regression model for genome size

All modelling used R version 3.2.2 (http://www.R-project.org). The initial standard regression model and stepwise model fitting was performed using the lm and step functions from base R. The generalized mixed-effect model was fitted in MCMCglmm141 (v2.24). To create a Bayesian mixed- effect model of genome size, the species tree was first transformed into an ultrametric tree using PATHd8142, with a small constant added to the shortest branches in the tree to ensure no zero- length branches were reconstructed, and then the outgroup species were removed from this tree. As an initial sanity check, a model almost equivalent to the basic regression model (see Results) was first constructed in this Bayesian setting (there were some detailed differences in the models, due to the prior distribution and because MCMCglmm includes an additive residual variance component for each observation). As both coefficients and significance levels from this analysis were similar to the multiple regression model we then included the phylogenetic random effect.

25 6 Mitochondrial genome analysis Mitochondrial genomes (mitogenomes) were reconstructed from Illumina reads with MITObim version 1.6143. Mitochondrial fragments in the nuclear genome assembly were identified by BLASTX using C. elegans mitochondrial genes as queries. Those fragments were extended by iterative mappings of Illumina short reads using MITObim. Assembled mitogenomes were annotated for protein-coding, tRNA and rRNA genes using the MITOS web server144. Assemblies and annotations were manually curated using the Artemis genome annotation tool145 based on evidence from sequence similarity to other published mitogenomes. Complete or ‘nearly complete’ mitochondrial genomes for 50 of the 81 helminth species were assembled de novo as described above, including for nine species whose nuclear genomes have been previously published (Supplementary Table 24). We were not able to obtain Illumina reads for Meloidogyne hapla in order to make an assembly, while for Globodera pallida we did not attempt assembly because it is known to have a multi-partite mitochondrial genome146. ‘Nearly complete’ assemblies (6/50 species) have a gap in the genome, mostly at the AT-rich non-coding regions. All these assemblies have been deposited in the DNA Data Bank of Japan (DDBJ)101. No reasonable mitochondrial assemblies could be obtained for four of our 45 newly sequenced species: Schistosoma rodhaini, Elaeophora elaphi, Soboliphyme baturini and Protopolystoma xenopodis. Apart from E. elaphi, these species have relatively low nuclear genome assembly contiguities (as measured by N50/scaffold-count: 0.4, 3.2, 0.9, 0.01; Supplementary Table 1). Including 25 mitogenomes from previous studies, a total of 75 mitochondrial genomes (12 cestode, 10 trematode, one other platyhelminth (Schmidtea), 52 nematode (counting Rhabditophanes as one)) were used in further analyses.

Phylogenetic analysis Amino acid alignments for 12 protein-coding genes were separately generated using MAFFT147 with options (-L-INS-i), and trimmed by Gblocks148 with less stringent options (-b3=8 -b4=4 –b5=n – b6=y) to remove poorly aligned sites. The best substitution model for each alignment was estimated by ProtTest149. All the protein alignments were concatenated, and maximum likelihood trees were constructed (one for cestodes, one for nematodes, and one for trematodes) using RaxML v8.2.7150. Bootstrap values were calculated with 1000 pseudosamples.

7 Defining high-quality ‘tier 1’ species for downstream analyses In order to confidently identify clade-specific genome innovations (e.g. gene family expansions), a subset of nematode and platyhelminth genomes (33/81), termed ‘tier 1’, were selected that had better quality assemblies and spanned the major clades (Supplementary Table 3). Subsequent exploratory analyses were carried out on these 33 tier 1 species. Evolutionary trends identified using tier 1 were confirmed by analysing the full set of 81 species. To choose the tier 1 species, the 81 species were divided into 14 different species groups, the majority of which correspond to major phylogenetic clades (listed as ‘Analysis Group’ in Supplementary Table 3). Species with highly fragmented assemblies (N50/scaffold-count < 0.3) were discarded; and species were selected that (i) had the most contiguous assemblies (mostly with N50/scaffold-count > 5), and relatively complete proteomes (usually CEGMA partial score >

26 85%), or (ii) that helped to ensure that ~50% of the genera in each phylogenetic group were represented (full reasons for species choice are given in Supplementary Table 3). Tier 1 comprised species from the major platyhelminth and nematode clades that included parasites, plus four free-living nematodes, but excluded all outgroup species.

8 Compara database of gene families

Construction of the in-house Compara database To establish orthology relationships among the 81 platyhelminth and nematode species species, an in-house Ensembl Compara15 database was constructed containing these species, and ten additional outgroup species: Amphimedon queenslandica (a sponge), Capitella teleta (an annelid), Ciona intestinalis (sea squirt), Crassotrea gigas (a mollusc), Danio rerio, D. melanogaster, Homo sapiens, Ixodes scapularis (a tick), Nematostella vectensis (a cnidarian), and Trichoplax adhaerens (a placozoan) (Supplementary Table 4). The published C. elegans reference genome was used in the database, rather than the draft de novo assembly produced as part of this project. Ensembl Compara is a pipeline to cluster proteins from an input set of species into gene families, build a phylogenetic tree for each family, and infer orthologs and paralogs from the trees. In summary, the pipeline did the following. For each species, the longest protein translation for each gene was searched against all other protein sequences in the database using NCBI-BLAST151. Graphs were constructed with edges between the proteins (nodes) retained if they were the best reciprocal hits (BRH); or had BLAST score ratios (BSR) >0.33, where the BSR is defined as score between two proteins, P1 and P2, divided by the maximum self score for P1 or P2. From the graph, the connected components (i.e. single linkage clusters) were extracted. Each connected component represents a cluster, that is, a gene family. If a cluster had more than 750 members, the graph construction and clustering steps were repeated at higher stringency. Multiple alignments of proteins in the same cluster were made using MUSCLE152 and the phylogenetic tree program, TreeBeST (http://treesoft.sourceforge.net/treebest.shtml) used to infer paralogs and orthologs, from back-translated protein-based multiple alignments and an input species tree (described below) that is necessary for gene and species tree reconciliation. The resulting trees were flattened into ortholog and paralog tables of pairwise relationships between genes. In the case of paralogs, this flattening also records the timing of the duplication due to the presence of extant species that contain the gene duplication. The species tree used to construct an initial version our compara data included 77 species and used an edited version of the NCBI taxonomy153 that had several controversial speciation nodes represented as multifurcations. For subsequent versions of the database, the input species tree was built based on orthologs inferred from the previous version, by finding sets of one-one orthologs present in at least 20 species were extracted and aligned using MAFFT v6.857147, then poorly aligned regions trimmed with GBlocks v0.91b with default parameters. These alignments of the different gene families were concatenated, and used to build a maximum likelihood tree using a partitioned analysis in RAxML v7.8.6 150 under the best-fitting (minimum Akaike information criterion) amino acid substitution model for each ortholog group.

27 Identification of gene families, orthologs and paralogs The in-house Compara database was queried using the Ensembl Perl API, to identify gene families, orthologs, and within-species paralogs. Clade- and species-specific gene families were identified by finding the level of the root node of each gene tree. As mentioned above; the Ensembl Compara pipeline splits up large families of >750 genes; for such cases, we merged these subfamilies back into one large family, yielding 109,571 gene families.

9 Identification of synapomorphic gene families The Compara protein families (Supplementary Information: Results 2.1) were analysed using KinFin v0.8.3 154 by providing InterPro IDs from functional annotations (InterProScan output; Supplementary Information: Methods 21) and the phylogenetic relationships of the included taxa. All analysis was performed on the tree topology shown in Fig. 2, which has clades III, IV and V as a polytomy. Synapomorphic clusters were determined using Dollo parsimony at 28 nodes of interest across the phylogenetic tree (Supplementary Table 8), requiring that a Compara family must contain genes from at least one descendant species from each child node of the node in question, and must not contain any other species. The synapomorphic families were filtered based on percentage of species present (requiring taxon coverage: ≥ 90% of descendant species of node) and functional annotation of families. Functional annotation of a gene family was inferred if more than 90% of the species present contained at least one gene with a particular domain. The counts for synapomorphies by node of interest and their inferred functional annotations are given in Supplementary Table 8.

10 Phylogenetic analysis of candidate lateral gene transfers Ferrochelatase Compara families were extracted by screening for genes annotated with a Ferrochelatase (IPR001015) domain. Functional ferrochelatases (also containing the domain Ferrochelatase active site IPR019772) were found in family 787620, composed mainly of nematode sequences, and family 850580, harbouring sequences from non-nematode taxa. Ferrochelatase-like sequences (devoid of active site) were recovered from family 740872 and family 1184543. Additional ferrochelatase protein sequences were retrieved for 17 bacterial taxa (YP_002977390.1, NP_386909.2, ZP_07659792.1, YP_198549.1, ZP_03788224.1, NP_966898.1, YP_001975511.1, YP_507215.1, YP_303255.1, YP_196566.1, ABV58328.1, YP_001266120.1, YP_350458.1, YP_234061.1, YP_003998063.1, EFQ76108.1, and ADO44739.1) from NCBI. Sequences were aligned using MAFFT v7.267 (E-INS-i algorithm)147 and the alignment was trimmed using trimAl v1.4155. Phylogenetic analysis was carried out RAxML150 under the PROTGAMMAGTR model of sequence evolution and 20 alternative runs on distinct starting trees. Non-parametric bootstrap analysis was carried out for 100 replicates.

28 A similar process was followed for the cobyric acid synthase and acetate/succinate transporter proteins – the highest scoring BLAST hits from GenBank, and representative sequences from other taxonomic groups were aligned with MAFFT v7.205 147 with –auto flag to allow the software to select the most appropriate alignment algorithm; then trimmed with trimal v1.4 155. Phylogenetic analysis was performed using RAxML version v8.2.8 150 under the model that minimized the Akaike Information Criterion from the empirical amino acid substitution models available in the software (LG4X for cobyric acid synthase, LG4M for acetate transporter), based on 5 random addition- sequence replicates, and 100 non-parametric bootstrap replicates.

11 Network representation of gene families The network representation of the Compara protein clustering (Supplementary Fig. 17) was generated using the generate_network.py script distributed with KinFin v0.8.3 (http://doi.org/10.1101/159145). Proteomes are represented as nodes in the graph and edges between two nodes are weighted by the number of times proteins of both proteomes appear together in a cluster. The nodes in the graph were positioned using the force directed ForceAtlas2 layout algorithm implemented in Gephi v0.9.1156 (parameters: “Approximation”=1.2, “Approximate Repulsion”=True, “Scaling”=10000, “Stronger Gravity”=True, “Gravity”=1.2, “LinLog mode”=True, “Dissuade hubs”=False, “Prevent overlap”=False, “Edge Weight Influence”=1.0). Under this layout algorithm nodes repulse each other like charged particles, while edges attract their nodes like springs. Nodes were coloured by taxonomic group and scaled proportional to the size of the proteome, and edges were coloured by the connection they establish (within/between nematodes/platyhelminths/outgroups). The PageRank for each node (proteome) in the network, which is a measure of node centrality, was calculated within Gephi (“Epsilon”=1.0e-6, “Probability”=0.85, “Use Edge Weight”=True).

12 Phylogenetic tree based on gene family presence/absence To quantify the degree to which gene family content reflects the phylogeny, we constructed a maximum-likelihood phylogeny using a matrix of gene family presence and absence for families that are not shared by all 81 species, using RAxML v8.2.8150, with a two-state model and the Lewis method to correct for the absence of constant-state observations in the data matrix.

13 Identification of gene family expansions

Filtering gene sets and Compara families for transposable element genes Interproscan 5133 was used to identify predicted proteins with Pfam domains associated with transposable elements (using a list from Foth et al53): PF12762, integrase; PF03221, DNA-binding; PF03184, endonuclease; PF00078, ; PF03564, DUF1759; PF05380, Pao retrotransposon peptidase; PF10551, ; PF00077, retroviral aspartyl protease; PF13456, reverse transcriptase-like; PF00665, integrase; PF14227, Gag-polypeptide of LTR copia-type; PF03732, retrotransposon gag protein; PF01541, GIY-YIG catalytic domain; PF00680, RNA dependent RNA polymerase; PF07727, reverse transcriptase; PF13961, DUF4219;

29 PF01359, transposase; PF08284, retroviral aspartyl protease; PF13976, GAG-pre-integrase; PF14223, Gag-polypeptide of LTR copia-type; and PF14244, Gag-polypeptide of LTR copia-type). A total of 2153 gene families were identified that had at least one transposon Pfam domain assigned to at least one member. We considered a family as ‘transposon-related’ if ≥20% of its genes (with or without any Pfam annotation) had a transposon-associated Pfam domain, and if ≥80% of the genes with at least one Pfam annotation had a transposon-associated Pfam domain. This identified 1220 gene families, which were subsequently excluded from further analyses of gene family expansions.

Metrics for identifying highly variable gene families To identify gene families that vary greatly in gene count across the species tree, we developed three metrics to capture aspects of this variability. However, to control for differences in gene counts due to partial assemblies (e.g. genes split across multiple contigs), we used summed protein length per species for each family as a proxy for gene counts in these metrics. The first metric, coefficient of variation (Cv), simply measured the variability in gene count in a family (see below). The other two metrics, Z-score (Zmax) and enrichment coefficient (Emax), reflected whether there is a gene family expansion in a particular group of species, relative to the rest of the species tree. To do this, species were classified into a set of non-overlapping species groups (‘Analysis group’ in Supplementary Table 3), such as ‘Trematodes - Schistosomatids’, ‘Trematodes - Other’, ‘Cestodes’, and so on. These species groups are mostly monophyletic, with a couple of exceptions (e.g. ‘Platyhelminthes - other’, ‘V - other’) and are hereafter termed ‘narrower species groups’. To identify gene families that have expanded across a phylogenetically broader range of species, we also defined a ‘broader species group’ of non-overlapping species groups (Supplementary Table 3), such as ‘Trematodes’, ‘Cestodes’, etc. The three metrics were defined as follows:

(i) coefficient of variation,

Cv = s / 푥̅ where s is the standard deviation in the summed protein length per species in the family, and 푥 is the mean of the summed protein length per species in the family.

(ii) maximum Z-score, 푥̅푖,푖 ∈푐 –푥̅ Zmax = 푚푎푥 푐 ∈ 푇 ( ) 푠푖 ,푖 ∉푐 where T is the set of non-overlapping species groups, c is a species group in the set T, 푥̅i is the mean of the summed protein length (per species) in one of the species groups c in set T, and si, i ∉ 푐 is the standard deviation in summed protein length per species in the species outside group c. That is, for each species group in the set T (e.g. in the set of ‘narrower species groups’), we calculated Z as the difference between the mean summed protein length (per species) in that species group and the overall mean 푥̅, divided by the standard deviation si, i ∉ 푐; Zmax is the maximum of these Z values over all tested species groups. Note that the standard deviation used here was for the species outside species group c, so that it was not affected by any gene family expansion that occurred in the family within the species group c.

30 (iii) maximum enrichment coefficient, 푥̅푖,푖 ∈푐 Emax = 푚푎푥 푐 ∈ 푇 ( ) 푥̅푖,푖 ∉푐 That is, for each species group in the set T, we calculated E as the ratio between the mean of the summed protein length (per species) in that species group and the mean summed protein length outside that species; Emax is the maximum of these E values over all tested species groups. When calculating the means and standard deviations that contribute to these metrics, the 33 ‘tier 1’ species (those with high-quality assemblies; Supplementary Information: Methods 7) that lack any genes were taken into account, as follows. First, to calculate the mean and standard deviation for a whole family (e.g. for Cv), we identified the node of the species tree corresponding to the root of the gene tree for the family (e.g. the ancestor of all strongylids). Then, for each tier 1 descendant species (e.g. all strongylids) that is present on the species tree but not on the current gene family, we counted its summed protein length as zero when calculating the mean and standard deviation. Similarly, when calculating the means and standard deviations for a particular clade (for Zmax or Emax), the node of the species tree corresponding to the root of the clade was identified, and all tier 1 species that descended from that node of the species tree were taken into account. Note that the three metrics may pick up families with different patterns of variability. Z-score and enrichment coefficient may detect a gene family expansion in a particular clade of the species tree, but the coefficient of variation may pick up cases that they miss, such as a gene family that has independently expanded in different clades of the species tree (e.g. and schistosomatids). In addition, the Z-score and enrichment coefficient can can only detect a gene family expansion in a particular clade of the species tree if the gene family also includes genes from species outside this clade, whereas the coefficient of variation can detect large variability within a single clade (e.g. schistosomatids) even if the family lacks members from outside that clade. These metrics were calculated for each of the 108,351 families (after excluding 1220 transposon families) in our in-house Ensembl Compara database. To further increase the reliability of these measures, they were calculated by only considering the tier 1 species, since these species have the highest quality assemblies, and therefore most complete proteomes and fewest artefactual gene splits and merges. Because at least two genes are needed to calculate the metrics across the 108,351 families, we only calculated these metrics for the 40,599 families that include at least two genes from the 33 tier 1 species. Furthermore, in very small families, the estimates of Cv, Zmax and Emax are very noisy (because the underlying estimates of mean and standard deviation are noisy), so we discarded families having <10 genes from tier 1 species, leaving 10,986 families. Each metric was calculated for both the ‘narrower species groups’ (e.g. ‘Trematodes - Schistosomatids’, ‘Trematodes - Other’, ‘Cestodes’, etc.) and the ‘broader species groups’ (e.g. ‘Trematodes’, ‘Cestodes’, etc.). Because the coefficient of variation is independent of the species (it is calculated based on all genes from all species in the gene family), for each family we calculated five variables reflecting variability: Cv, Zmax and Emax for the original species groups, and

Zmax and Emax for the broader species groups.

31 We wanted to filter the list of 10,986 families of ≥10 (‘tier 1’) genes to identify those families that had the most striking patterns of variability. Therefore, we took the union of the top 500 highest- scoring families according each of the five metrics (Cv, Zmax and Emax for the narrower species groups, and Zmax and Emax for the broader species groups), yielding 1,248 families (Supplementary Table 9).

14 Species Tree A whole-genome phylogeny for helminths was constructed from single-copy gene families. Briefly, we used the Compara database described above to identify a set of 202 protein-coding gene families that were present in at least 25% of the 91 species (81 helminths and 10 outgroups) in the database and single-copy in every species they were found in. The predicted amino acid sequences for these was aligned independently for each family using MAFFT v7.205147, with the -- auto flag to allow the software to choose the most appropriate alignment algorithm for the data. These alignments were then trimmed to remove ambiguous regions using GBlocks v0.91b148, set to be relatively conservative in the level of trimming performed (parameters -b4=4 -b3=4 -b5=h). For each trimmed alignment, the likelihood of the alignment was calculated on a maximum- parsimony guide tree topology for all of the relatively simple (single-matrix) amino acid substitution models available in RAxML v8.0.24150 and the best-fitting model (minimum Akaike Information Criterion; minAIC) was identified. The alignments were then concatenated and the maximum- likelihood topology found under a partitioned model in which sites from each gene were assigned the minAIC model for that gene, with a discrete gamma distribution of rates across sites. The phylogeny shown here was the best (highest likelihood) from 5 independent heuristic search replicates with different arbitrarily chosen random number seeds, although all 5 searches found identical tree topologies with very similar reported likelihoods. 100 bootstrap resampling replicates were performed to assess support for nodes on this phylogeny, with each based on a single rapid search. In every case, relationships within the outgroup lineages were constrained to match the now standard view of metazoan relationships (e.g. that found in157), as our sampling in that part of the tree was insufficient to correctly resolve all of those branches.

15 Novel domain combinations All pairwise combinations of Pfam domains were identified in the predicted proteins of the 81 nematode and platyhelminth species. Excluding those also present in the predicted proteins from complete genomes of other phyla in Uniprot (June 2016), 14,596 combinations of 4216 different Pfam domains were found in the nematodes and platyhelminths. Of these, 4435 combinations occurred in more than one species. We then looked for combinations specific to either nematodes or platyhelminths. We classified a protein domain combination as being specific to nematodes if it was present in more than 30% of nematode species and in no platyhelminths. Using this definition, we found 131 combinations specific to nematodes (Supplementary Table 14a). Using a similar approach for yielded 50 protein domain pairs specific to platyhelminths (Supplementary Table 14b).

32 16 Ion Channels and ABC Transporters Ion channels and ABC transporters were analysed using two approaches. First, proteins representing members of the ABC transporter and cys-loop receptor subunits from C. elegans were collated from Wormbase134. Predicted proteins from other species, homologous to these, were identified and assigned a closest homolog by reciprocal BLASTP. The predicted ion channels and ABC transporters identified using this approach are given in Supplementary Table 16 and Supplementary Table 17, respectively. These tables were used to produce the heatmaps of ion channels and ABC transporters, respectively. To build a phylogenetic tree of ion channels, a slightly different data set was used. Known cys-loop receptor accession IDs from C. elegans158, Brugia malayi159, Haemonchus contortus160, Oesophagostomum dentatum160, and Schistosoma mansoni58 were gathered and used to parse families from the database of Ensembl Compara families at WormBase ParaSite135. The outgroup (human and fruitfly) sequences were taken from UniProt. HMMTOP was used to predict transmembrane (TM) domains, and genes with fewer than three or more than eight TM domains were removed from further analyses. The resulting 1347 genes were aligned with MAFFT147, trimmed with trimAl155, and the alignment was manually edited so that poorly aligning sequences were removed. Of the 1347 genes, 278 were removed because they were fragments or aligned poorly, leaving 1069 genes (Supplementary Table 16b). The phylogeny was inferred with MrBayes3.2161. Posterior probabilities were calculated from eight reversible jump MCMC chains over 20,000,000 generations. Nodes with high posterior probability were used to define subfamilies so that each gene was in a single family. The tree was visualised with ggtree162, which is implemented in R.

17 Proteases To find peptidase and peptidase inhibitor homologues, each helminth proteome was submitted to the MEROPS batch-BLAST163. This service returned a list of sequence accessions from the submitted library that were significantly similar (E value < 0.001) to a sequence in the merops_scan.lib sequence library. PfamScan, which compares each sequence to the Pfam HMM models and identifies homologous domains164, was used to find domains in some proteomes. Although subfamilies were examined manually and discussed in the results, comprehensive subfamily reporting was difficult to achieve because not all sequences fitted into existing MEROPS subfamilies. Thus, Supplementary Table 11 reports proteases at the MEROPS family level.

18 Kinase prediction Kinase domain models were downloaded from the Kinomer website (http://www.compbio.dundee.ac.uk/kinomer/allPK.hmm). Custom score thresholds per kinase class were taken from the Kinomer paper165 and then adjusted until an hmmpfam search (HMMER v2.3.2) came as close as possible to identifying all known C. elegans kinases using the Kinomer allPK.hmm profile database. The final cutoffs used were: TK, 5.5e-03; CAMK, 9.6e-07; CK1, 1.1e- 02; CMGC, 6.7e-03; AGC, 1.1e-14; STE, 3.4e-03; RGC, 4.8e-05; TKL, 8.7e-03; PDHK, 4.7e-160; PIKK, 1.4e-06; Alpha, 8.5e-66; and RIO, 7.5e-10. These cutoffs were used as a filter for the

33 hmmpfam mapping of the proteins from the 81 nematode and platyhelminth species against the kinomer models. Genes with hmmpfam 'hits' meeting these cutoffs were classified as a putative kinase of the given class. The kinase annotations are given in Supplementary Table 25.

19 Heatmaps Heatmap figures displaying gene abundance values (e.g. for kinases, GPCRs, ion channels) were constructed based on gene counts (available in supplementary tables) that were normalised by dividing by the total gene count for each species. Coloring for the visualisation was performed using ‘Conditional Formatting’ with MS Excel.

20 Signal peptide for secretion and TM domains predictions Signal peptides and transmembrane domains were identified using Phobius166 version 1.01 and SecretomeP167 version 1.0 (32-bit version). Any protein found by Phobius to have a transmembrane domain was categorized as a membrane-bound protein. Proteins not found to be membrane-bound were classified as classically secreted if a signal peptide was detected by Phobius within 70 amino acids of the start of the protein. Proteins not meeting these conditions that were detected as signal peptides by SecretomeP were labelled as non-classically secreted proteins (Supplementary Table 7).

21 InterPro and GO annotations InterProScan168 version 5.0.7 was run with default parameters on the protein sequences of all the predicted genes to annotate genes with InterPro (IPR) identifiers and Gene Ontology (GO) classifications.

22 Species-level functional enrichment (GO / InterPro / Pfam) analysis The number of proteins annotated with each GO term, InterPro domain or Pfam domain were counted per species, and then counts were normalised based on the total count of GO/InterPro/Pfam annotations within each species. Mann-Whitney U tests (rank sum tests) were run using the normalised protein abundance for each annotation, using sets of target species, and all other species as the background. Mann-Whitney U test P-values were normalised for each set of target species using FDR adjustment, using only annotations identified at least 10 times, in at least 5 species. Log2 fold-change was calculated by adding 1x10-8 to both values, in order to correct for zero values (Supplementary Table 26).

23 SCP/TAPS protein family SCP/TAPS proteins included in the phylogenetic analysis were identified from our predicted proteome data with the following criteria to filter sequences: (1) a Pfam domain PF00188, or being included in any SCP/CAP family in our Compara database; and (2) a minimum length of 146 aa,

34 which is the length of the shortest C. elegans SCP/TAPs gene, and maximum length of 1,000 aa. After filtering, 3,167 proteins were identified from 83 species, comprising the 81 platyheminths and nematodes, with Homo sapiens and D. melanogaster as outgroups. The 3,167 protein sequences were first separated according to their species group (Supplementary Table 3), and clusters were then detected among the sequences from each species group by using USEARCH140 (UCLUST algorithm, amino acid identity cutoff=0.70). Out of the 3,167 proteins (including 3,121 from helminths and 46 from outgroups), 1,456 proteins were placed into 498 clusters. Consensus sequences were generated for each of these clusters, removing ambiguous alignment columns. When combined with the 1,711 singleton protein sequences, a final dataset of 2,209 (=1,711+498) sequences was generated (Supplementary Table 10). The 2,209 sequences were aligned with MAFFT147 (v7.271, --localpair --maxiterate 2 --retree 1 --bl 45). To improve the alignment quality, we firstly constructed a draft phylogenetic tree with FastTreeMP169 (version 2.1.7 SSE3, -wag -gamma), and then used this tree as a guide tree to re- align the sequences with MAFFT again (--localpair --maxiterate 2 --retree 1 --bl 45 --treein). The refined alignment was then slightly trimmed with trimAl155 (-gt 0.006), and used to compute a maximum likelihood phylogeny with FastTreeMP (version 2.1.7 SSE3, -wag -gamma).

24 GPCR analysis Annotated GPCRs from C. elegans, B. malayi, O. volvulus, S. mansoni and Schmidtea mediterranea (providing information from both parasitic and free-living flatworms/roundworms) were identified from literature mining and previous GO annotations (all manually and computationally assigned annotations to GO term ‘G protein-coupled receptor activity’ (GO:0004930) in WormBase134. This list included almost 2,000 accessions: 1,339 from C. elegans, 59 from B. malayi, 85 from O. volvulus, 104 from S. mansoni, and 260 from S. mediterranea (Supplementary Table 15e). These were used as seeds for extraction of GPCR families created by the Compara pipeline (Supplementary Information: Results 2.1). We identified 346 Compara families that contained at least one seed (Supplementary Table 15c). A cursory examination of the 346 putative GPCR families revealed a substantial amount of false- positives, likely due to imprecise GO annotations in the curated seeds. Taking the 346 families, a family-centric algorithm was used to filter out false-positives and assign GPCR class information. The sequence for each family member was extracted, and families were aligned with MAFFT147 (mafft --auto), and alignments were trimmed with trimAl1.4155 (trimal -automated1) to remove uninformative sites. Finally, HHSuite170 was used to build a HMM for each family, and we used the HMM to search (hhsearch) against databases of HMMs included in the HHSuite package (HMMs from, or based upon, Uniprot, , Pfam, and PDB). This strategy focused on the conserved element of each family, which was likely the element that initially compelled Compara clustering. The output from this approach was best-hit information for each family against each of the four databases. Families supported by hits to at least two databases were deemed to be actual GPCR families. Serpentine and 7TM Pfam domains were present in many proteins that appeared as false-positives when searching for GPCRs, which is why multiple lines of evidence were used to identify likely GPCRs rather than just one.

35 To ensure proper designation and to assign classification, a random member from each putative GPCR family was used in a blastp search against the non-redundant protein database at the NCBI. All of the best-hit information was used to assign each family to one of the GPCRdb171 classes (Class A, B, C, and F). A total of 200 families were identified in this way. In addition, synapomorphic gene families (Supplementary Information: Methods 9) annotated as GPCRs were found for platyhelminths (5 families), Neodermata (7), trematodes (2), cestodes (1), clade IVa (12), Strongyloidoidae (8), Strongylomorpha (1), nematodes (2), clade I (2), clade IIIa (1), (2) and Filarioidea (1) (44 families in total; Supplementary Table 8). These were manually examined, and most were judged to be valid GPCRs. Furthermore, two additional class B, or adhesion/secretin-like GPCR Compara families were identified manually. Thus, in total we identified 230 GPCR families, which included 5,939 genes from our 33 ‘tier 1’ (with high-quality assemblies; Supplementary Information: Methods 7) platyhelminth and nematode species (Supplementary Table 15a, 15b, 15d). Note that several previous analyses of helminth genomes (e.g. Jex et al172) have used GPCRSARfari in ChEMBL173 to identify putative GPCRs. However, GPCRSARfari is heavily biased towards class A, human receptors, and so would be unlikely to identify some platyhelminth or nematode-specific GPCRs such as the platyhelminth rhodopsin-like orphan family (PROF)174. As a result, we preferred to use our Compara database of gene families as a starting point for identifying putative GPCRs.

25 Metabolism

Assigning ECs to predicted proteins and generating high-confidence EC predictions The predicted proteins of all 81 species were first screened for sequence similarity to proteins with associated EC ( commission) numbers in curated databases: (i) DETECT175 v2.0 (cutoff

ILS ≥ 0.9, ≥ 5 positive hits); (ii) PRIAM176, Feb-2014 (minimum probability >0.5, profile coverage >70%, check catalytic - TRUE); (iii) KAAS (KEGG Automatic Annotation Server)177; a locally installed version 2 of KAAS was used with default settings (i.e. bi-directional best hit with bit-score threshold of 35). The KOs (KEGG Orthologs) annotated by KAAS were associated with corresponding ECs using the KO and EC definitions in KEGGv70, and (vi) BRENDA178. From these assignments, a set of consolidated high-confidence predictions was derived by collating predictions from BRENDA (which are curated from literature) and DETECT (a robust prediction method which accounts for sequence diversity across enzyme families), and considering predictions identified by both PRIAM and KAAS (Supplementary Table 18a). Gene families obtained via the Compara workflow were used in some cases to confirm absence of ECs related to loss of certain metabolic pathways highlighted in the main results. For this purpose low-confidence ECs were assigned to those unannotated genes that were in a Compara family which included at least two genes from tier 1 species (species with high-quality assemblies; Supplementary Information: Methods 7) that were assigned a particular single EC, and no other gene from tier 1 species in the family that was assigned any other EC (Supplementary Table

36 18b). These lower confidence ECs were not analysed further and were only used to reduce potential false negatives among the interesting metabolic findings reported

Reconstructing metabolic pathways and pathway hole-filling To reconstruct the metabolic pathways for each of the 33 tier 1 species, we applied the pathway tools pipeline (v18.5) to the set of high confidence EC predictions obtained above (Supplementary Table 18a). Pathway tools relies on definitions of metabolic pathways from the BioCyc database179 and predicts the set of metabolic pathways likely to be present in the organism, based on the input set of EC annotations. This analysis was used for predicting vitamin and amino acid auxotrophies. Briefly, the algorithm uses a set of rules to assign evidence scores for pathway predictions based on: presence of most of the ECs for a pathway, presence of unique ECs, presence of the first two steps (for a degradation pathway), presence of the last two steps (for a biosynthetic pathway), presence of >50% enzymes (for energy metabolism pathways). It also uses taxonomic pruning, wherever information is available, to reduce false-positives. Pathway tools also uses BLAST searches along with gene neighbourhood information to assign genes to pathway . KEGG defines ‘reference pathways’ for a limited set of species; only the KEGG pathways that had at least one reference pathway for a nematode/platyhelminth species in the KEGG database were included in our analysis. This meant excluding pathways such as ‘Carbon fixation in photosynthetic organisms’, which were deemed irrelevant to helminths, even though some of the enzymes implicated in these pathways were found in helminths. In addition, we excluded caffeine metabolism, which had only a single EC (out of a total of 14) annotated in C. elegans and C. briggsae (the only two nematodes with a reference pathway defined by KEGG for this), and was deemed unlikely to be of relevance to most helminths in this study. The final 65 KEGG pathways deemed to be ‘helminth-relevant’ are indicated in Supplementary Table 18e. For these helminth-relevant pathways, annotations were expanded further by using the pathway hole-filler component (default settings) of Pathway tools to assign genes to pathway holes (Supplementary Table 18a). Only those predictions to genes not assigned ECs by any of the methods or those supported by BLASTP hits of E-value of ≤1e-10 (against SWISSPROT enzymes) and EFICAz v2.5180 (default settings) were used to augment the high-confidence set of predictions for the organism. Since hole-filling was only performed for the 33 tier 1 species, all metabolism analyses were performed on the EC annotations including pathway hole-filling, except the comparisons based on all 81 species for reasons of consistency (i.e. Extended Data Fig. 6a and Supplementary Table 20). The ECs inferred using Compara were only used for verification of pathway loss (e.g. module completion in Fig. 5a). Pathways with striking lineage specific differences were manually reconciled against the scientific literature. Where unexpected gaps in pathway coverage were found, candidate sequences from other organisms were used in sequence similarity searches to identify potential missing enzymes. Manually curated candidates for pathway holes are listed in the tables cited within the respective results sections below.

Analysis of KEGG metabolic modules and pathways Predicting presence of KEGG modules We also examined the presence/absence of KEGG metabolic ‘modules’ using the modDFS algorithm181. In brief, the algorithm starts from the final product of the module and systematically

37 traverses all those nodes which can produce this product by a chain of substrate-product relations. For the 33 tier 1 species the full set of detected ECs, including those identified using pathway hole- filling, was used and for the remaining species analyses were based on the high confidence EC annotations (Supplementary Table 18a). Species clustering based on presence/absence of modules was performed using Ward-linkage based on the Jaccard similarity index182.

Coverage comparisons among KEGG metabolic pathways For every helminth-relevant KEGG metabolic pathway (as defined above in the methods for Reconstructing metabolic pathways and pathway hole-filling), the coverage was compared separately among different groups of worms. The coverage for a pathway in a particular species was calculated as (number of ECs of the pathway annotated in that species x 100) / (Total number of ECs for that pathway in KEGG). These groups were either phylum level (platyhelminths and nematodes) or subsets thereof (cestodes and trematodes for platyhelminths and parasites from different clades of nematodes). The ECs from hole-filling were included when calculating coverage. Including just the parasites meant defining groups like ‘Clade IVa-’ (all Clade IVa worms except the free living Rhabditophanes) and ‘Clade V-’ (all Clade V worms except the free living C. elegans and P. pacificus). All the group definitions are given in Supplementary Fig. 12. The comparisons were performed using Wilcoxon tests, and FDR corrected P-values (corrected using the Benjamini- Hochberg procedure) were used to assign significance (P<0.05). The coefficient of variation of pathway coverage was used to measure the variation in coverage of these pathways in different worm groups. Comparisons were also performed across all worms between different ‘superpathways’ (e.g. combining all ‘amino acid metabolism’ pathways together). Wilcoxon test over the distribution of coefficient of variation was performed for these comparisons.

Analysis of chokepoints in metabolic pathways A ‘chokepoint reaction’ is defined as a reaction that either consumes a unique substrate or produces a unique product. The chokepoint enzymes were identified according to Taylor et al183, with the following modification: the metabolic networks analysed were not the entire reference reaction sets in KEGG, but only the subnetworks formed by the reactions annotated in the species of interest (including ECs from hole-filling), resulting in more organism-specific metabolic networks. Chokepoints are reported in context of these species-specific networks. Clustering of species based on detected chokepoint enzymes was performed using the Jaccard similarity index and Ward-linkage method. Across species we identified between 50% (O. volvulus) and 61% (F. hepatica) of enzymes to be chokepoints. Since the number and identity of chokepoints is expected to be sensitive to accuracy and completion of the inferred metabolic network, we evaluated this sensitivity by comparing the set of chokepoints before and after pathway hole-filling in the 33 tier 1 species. In general, the number of predicted chokepoints changed slightly with an average of 1.7% of chokepoints removed after hole-filling and an average of 7.8% gained (Supplementary Fig. 18a). Nematodes had on average 245 chokepoints and flatworms 197 chokepoints (Supplementary Fig. 18a). Clustering the species based on presence of chokepoints showed a pattern that was clearly different between the phyla, and lead to better phylum separation (Supplementary Fig. 18b) as compared to the entire metabolic network (Supplementary Fig. 19e). Comparing the identified

38 chokepoints, ~18% (84/474) and ~19% (89/474) were conserved in all nematodes and platyhelminths, whereas ~11% (53/474) and ~30% (143/474) were missing from each phylum, respectively (Supplementary Fig. 18c). Of 143 nematode chokepoints previously identified based on a comparative analysis of 10 nematodes species183, 111 were confirmed in the present study and 90 of these were also conserved across trematodes.

Carbohydrate active enzymes (CAZymes) The carbohydrate metabolism enzymes (CAZymes) were detected using HMMER3 hits (with E- value thresholds of 1e-13 for nematodes and platyhelminth proteins of >80aa, and 1e-9 for everything else) against the dbCAN database184, from gene sets of all the species. A phylogenetic tree of the genes in each CAZyme family was generated, and if necessary further categorisation was performed based on the tree. Some CAZyme classes had few hits (e.g. GT5, GT51), and were assumed to be bacterial contamination, so these were removed from the analysis (Supplementary Table 27).

26 Identification of Potential Anthelmintic Drug Targets and Drugs

Known anthelmintic drugs and compounds A list of known anthelmintic drugs (human and veterinary) and compounds, including nematicidal compounds (Supplementary Table 21a) was collated from: (i) WHO with ATC (Anatomical Therapeutic Chemical) code P02 (anthelmintics); (ii) WHO with ATCvet code QP52 (anthelmintics) or QP54 ( and ). Note that halodone is listed as a veterinary anthelmintic, but we do not find any other evidence of that, so have excluded it; (iii) Listed with anthelmintic activity in ‘The use of stems in the selection of International Nonproprietary Names (INN) for pharmaceutical substances’ (WHO, 2013, http://www.who.int/medicines/services/inn/stembook/en/); (iv) Listed as anthelmintic in the Merck Medical Manual or Merck Veterinary Manual, respectively (http://www.merckmanuals.com/); (v) Anthelmintic drugs for compounds for human or veterinary from the scientific literature 68,185-193; (vi) Tagged with MeSH categories 'anthelmintic', ‘anticestodal’ or ‘antinematodal’ in PubChem194. (vii) Listed as having ‘anthelmintic’, ‘anticestodal’, ‘antischistosomal’, ‘antitrematodal’, or ‘fasciolicide’ activity in KEGG Drugs195; (viii) Listed as ‘anthelmintic’ in the ChEBI database196; (ix) Listed in ChEMBL197 with the keywords ‘anthelmintic’ OR ‘anthelminthic’ in any of the following fields: ATC code description, mechanism of action, USAN (United States Adopted Name) stem definition, or indication class;

39 (x) Tagged with MeSH categories 'anthelmintics’, ‘antinematodal’, ‘filaricides’, ‘antiplatyhelmintic’ or ‘schistosomicides’ in DrugBank198, or identified by a text search for ‘anthelmintic OR antihelminthic OR antinematodal OR antitrematodal’ in DrugBank; (xi) 219 compounds from PubChem identified by searching for ‘anthelmintics OR anthelmintic OR nematocide OR nematicide’, and with low structural similarity to the list of compounds from (i)–(xi) above (similarity ≤0.4 when calculated using the Tanimoto coefficient of ECPF4 fingerprints as implemented in the RDKit toolkit (http://www.rdkit.org)). The OPSIN software was used to convert IUPAC names to SMILES strings199 that were used to search PubChem and ChEMBL for specific compounds. The resulting list of anthelmintic drugs (human and veterinary) and compounds (Supplementary Table 21a) includes 261 compounds.

Dendrogram of known anthelmintic compounds To calculate a dendrogram of known anthelmintic compounds, the SMILES strings for 255 known anthelmintic compounds were taken from Supplementary Table 21a. For seven of compounds, multiple SMILES were used (for example, because of different stereoisomers; see notes in Supplementary Table 21a column I), increasing the total to 263 SMILES strings. The pairwise similarity between the 263 SMILES strings in this dataset was calculated based on ECFP4 fingerprints of length 16384 as implemented in RDKit 2016.09.4 (http://rdkit.org/). The Tanimoto similarity between pairs of fingerprints was calculated using ChemFP 1.1200. The similarities had range 0.0-1.0, and were converted to distances using 1.0 - similarity. The distance matrix was read into R, and hierarchical clustering performed using Ward's minimum variance method (using ‘ward.D’ in the ‘hclust’ function). The dendrogram was cut at a height which separated the well known chemical classes of anthelmintic compounds avermectins, milbemycins, and imidazothiazoles from other compounds. If the members of the clusters defined in this way had little in common with respect to chemical structure, the cluster was further split up. This resulted in 44 chemical classes.

Identifying potential helminth drug targets To identify potential drug targets in ChEMBL197, all 528,469 helminth proteins from tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7) were searched against the 6261 single-protein targets from the ChEMBL 21 database (1,592,191 compounds and 11,019 targets) using BLASTP. ChEMBL includes both targets of known drugs and of other biologically active compounds. Only the top hit (with E-value ≤ 1e-10) was considered for each putative helminth protein (including isoforms), resulting in 106,278 helminth hits to 3,994 single- protein ChEMBL targets.

Target properties considered were those that would be most attractive in a potential new drug target, including: • Similarity to a known drug target in any species; • Lack of a human homologue, to avoid toxicity issues in humans;

40 • Whether C. elegans or D. melanogaster homologs have lethal or sterile phenotypes when disrupted. Thus, each of the 106,278 helminth genes was assigned an overall target score based on the sum of scores below (Extended Data Fig. 7) weighted as follows: (50 x BLAST score) + (40 x ChEMBL non-human score) + (30 x phenotype score) + (15 x chokepoint score) + (10 x multi-species score) + (10 x expression score) + (10 x species-distribution score) + (10 x invertebrate-biology score) + (5 x singleton score) + (5 x PDB score) + (5 x alignment-conservation score)

(i) BLAST score set to 1 for high sequence similarity hits to ChEMBL (E ≤e-85, and ≥80% of the residues in the ChEMBL protein covered by the BLAST match), otherwise it was set to zero. The E-value threshold of e-85 was chosen as it was the 0.75-quantile of the E-values for all the BLAST hits to ChEMBL. Overall, 14.6% of the helminth genes had BLAST score of 1. (ii) ChEMBL non-human score of 1 was applied to helminth proteins that only match non-human targets within ChEMBL, because for these targets, developing drug selectivity to avoid toxicity issues in humans should be easier. The human proteins in our Compara database were searched against ChEMBL 21 proteins using BLASTP. Where the top ChEMBL BLAST hit of a helminth gene had itself a match to human (with E ≤0.05) or belonged to a Compara family containing a human member, it was assigned a score of zero. 0.3% of the helminth genes had a ChEMBL non- human score of 1.

(iii) Phenotype score. Large-scale mutant phenotype data are not available for parasitic worms so essentiality was inferred from model organisms. A list of C. elegans genes with ‘lethal’, ‘sterile’, ‘L3 arrest’, ‘molt defect’, or ‘paralysed’ phenotypes (based on knockouts/knockdowns/variants) was downloaded from WormBase (July 2016). In addition, a list of D. melanogaster genes with ‘lethal’, ‘sterile’ or ‘paralytic’ (paralysed) was downloaded from FlyBase201. Lethal phenotypes were weighted more heavily than other phenotypes. If a helminth gene belonged to a Compara family containing both C. elegans and Drosophila genes with lethal phenotypes it was assigned a phenotype score of 1. Otherwise, nematode genes that belonged to families with C. elegans genes that had lethal phenotypes scored 0.9 and those in families with essential Drosophila (but not C. elegans) genes scored 0.8. Platyhelminth genes belonging to families with essential C. elegans or Drosophila genes scored 0.8. If a helminth gene belonged to a Compara family containing both C. elegans and Drosophila genes that both had ‘sterile’ phenotypes, or both had ‘paralysed’ phenotypes, it was assigned a score of 0.7. Otherwise, for a nematode gene, if it belonged to a family with a C. elegans gene with a paralysed/sterile/L3 arrest/molt defect phenotype, it was assigned 0.6; and if it belonged to a family with a (but no C. elegans) gene of paralysed/sterile phenotype, it was assigned 0.5. A platyhelminth gene that belonged to a family with a C. elegans or Drosophila gene of paralysed/sterile/L3 arrest/molt defect phenotype was assigned 0.5.

41 The distribution of scores was: 1 (5.3%), 0.9 (21.9%), 0.8 (7.9%), 0.7 (0.5%), 0.6 (6.6%), and 0.5 (4.2%). (iv) Chokepoint score set to 1 for helminth genes predicted to encode chokepoint enzymes (see Analysis of chokepoints above) and belonging to Compara families with ≥3 predicted chokepoints. Because chokepoint predictions are not very accurate, predicted chokepoints that did not belong to such families were assigned a score of 0.1. Non-chokepoint enzymes were assigned a zero score. Score distribution: 1 (9.6%), 0.1 (0.4%), 0 (90%). (v) Multi-species score. To penalise species-specific helminth genes that could include residual unfiltered contamination (e.g. from bacteria), this score set was to zero for helminth genes belonging to Compara families with only a single species, but otherwise set to 1. The reason for including this score was 99% of the helminth genes had multi-species scores of 1. (vi) Expression score. For the majority of helminth species, including filarial species, drugs should target the adult stage. However, for some species, targeting other stages is important, for instance: the metacestode stage for cestodes, larvae of Trichuris and Strongyloides, and somules of Schistosoma. In the absence of expression data for many stages from many species, adult and metacestode (for cestodes) expression data were used, from the set of ‘reference’ species used for gene-finding. For each helminth gene, expression data from the most closely related ‘reference’ species was used: H. contortus for clade V; A. suum for IIIb (Ascarididomorpha); B. malayi and O. volvulus for IIIc (Spiruromorpha+other) and IIIa (Oxyuridomorpha); T. muris for clade I; S. ratti for clade IV; H. microstoma for cestodes except E. multilocularis for Taenia species; and S. mansoni for trematodes. If the expression level of any corresponding ‘reference’ gene was ≥ 5.0 RPKM in adults or metacestodes, an expression score of 1 was assigned, otherwise zero was used. 70% of the helminth genes had expression scores of 1.

RNASeq studies and samples of interest were obtained from the European Nucleotide Archive (http://www.ebi.ac.uk/ena) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress): H. contortus (run accession SRR928055, SRR928056), A. suum (SRR504556, SRR504557, SRR504558, SRR504559, SRR504560, SRR504561), B. malayi (ERR048961, ERR048962, ERR048970, ERR048972), O. volvulus (ERR225734, ERR485009), T. muris (ERR279677, ERR279678, ERR279676), S. ratti (ERR299169, ERR299175, ERR299168, ERR299174, ERR299170, ERR299176), H. microstoma (adult: ERR225730, ERR225728, ERR225729; metacestode: ERR337915, ERR337928, ERR337940, ERR337952, ERR337964, ERR337976), E. multilocularis (metacestode: ERR337932, ERR337944, ERR337956, ERR337968, ERR337906, ERR337919), and S. mansoni (ERR022873). The analysis of each sequencing run was performed by the EMBL- EBI Team using their iRAP pipeline and was downloaded from their RNA-seq Analysis API (http://www.ebi.ac.uk/fg/rnaseq/api/doc) (July 2016). This pipeline aligns quality- filtered reads to reference genomes from WormBase ParaSite135 using TopHat 2202; and then quantifies expression of genes and exons in the corresponding GTF file from WormBase ParaSite using HTSeq203 (intersection-non-empty mode) and DEXSeq204 respectively. (vii) Species-distribution score. To give greater preference to potential helminth targets that are found in multiple species (so that the same drug may target multiple species), a score of 1 was

42 assigned if a helminth gene was present in a Compara family containing ≥90% of the tier 1 species from a major group of helminths. The major groups were defined as (in order of preference): all nematodes and platyhelminths, just nematodes, just platyhelminths, cestodes, filaria, trematodes, or schistosomatids. Otherwise, it was assigned a score of 0. 90% of the nematode and platyhelminth genes had a species distribution score of 1. (viii) Invertebrate biology score. If the ChEMBL BLAST hit is from a closely related animal, it is more likely that the helminths have conserved biology, for example, helminths may share processes involved in moulting and life cycle control with . Therefore, if the top ChEMBL BLAST hit of a helminth gene was to a UniProt protein from a non-chordate metazoan, we assigned it an ‘invertebrate biology score’ of 1, otherwise 0. We did not assign a score of 1 for matches to chordate proteins, in order to downweight targets that may be shared with the vertebrate hosts of helminths. 3% of the helminth genes had an invertebrate biology score of 1. (ix) Singleton score. Developing drugs against simple single-copy targets is likely to be easier than developing drugs for multigene families. The score was therefore set to 1 for helminth genes that lacked within-species paralogs (using our Compara database), otherwise zero score applied. 53% of the helminth genes had singleton scores of 1. (x) PDB score of 1 was assigned to helminth proteins that matched ChEMBL targets with available structures in the PDBe205. This was to reflect the possibility of structure-aided drug development. 51% of helminth genes had a PDB score of 1. The list of 106,278 helminth proteins included many sets of related helminth proteins with BLAST matches to the same ChEMBL protein. The list was collapsed to just take the top-scoring helminth protein matching each ChEMBL protein, and 3994 helminth proteins remained. The list still contained a lot of redundancy, since many of the helminth proteins were from the same Compara family and matched homologous ChEMBL proteins (for example, from human, mouse, rat); after further collapsing to just the highest-scoring helminth protein from each Compara family, 1925 helminth proteins remained. For each of the 1925 helminth genes, we took the compounds with activities to its best ChEMBL BLAST hit and also the corresponding compounds for all helminth genes in the same Compara family (including those that had a different top BLAST hit in ChEMBL). (xi) Alignment conservation score. Drugs that act against multiple species are highly desirable and this is more likely to occur if the target is sufficiently conserved between helminth species. Across each column of the alignment, for each Compara family, a score was calculated using the approach of Capra & Singh 2007206 using the Jensen-Shannon divergence, with a window size of 3 (on either side of the residue) and the BLOSUM62 background distribution. The overall score for a family was taken as the median of the scores for all columns that had scores of >-1000. Taking all the helminth genes with BLAST matches (E ≤1e-10) to ChEMBL, the 0.75-quantile of the alignment scores for Compara families to which they belonged was 0.68. Therefore, if a helminth gene belonged to a family with an alignment score of ≥0.68, it was assigned an ‘alignment conservation score’ of 1, and otherwise 0. 5% of helminth genes had alignment conservation scores of 1.

43 Identifying potential new anthelmintic drugs in ChEMBL The ChEMBL database (version 21)197 was used to identify 827,889 compounds, which included approved drugs, compounds in clinical development and bioactive compounds from the medicinal chemistry literature that had activities against ChEMBL single-protein targets to which our helminth proteins had significant (E ≤ 1e-10) BLAST matches. To assess the likely suitability of each compound as a potential new anthelmintic drug, an overall ‘compound score’ (Extended Data Fig. 7) was generated. Compound properties considered were those that would be most advantageous for a neglected disease drug207 including: • Compounds that could be more quickly and cheaply developed into drugs: we prioritised compounds in high clinical development phases and those where a crystal structure was available to help inform molecule design; • Compound properties: by focussing on compounds with properties consistent with those of oral drugs (Quantitative Estimate of Drug-likeness) and lacking known toxicity issues using Black Box warning information and toxic effect predictors; • Preferred route of administration for the ideal anthelmintic: compounds with oral or topical administration were considered most desirable. The compound score was based on the sum of parameters (below) that were retrieved from the ChEMBL database and weighted as follows: (5 x QED) + (5 x Maximum development phase) + (5 x Route of administration) + (5 x Black box warning information) + (5 x Molecule structural information availability) + (2.5 x Toxicology target interaction prediction) (i) Quantitative Estimate of Drug-likeness (QED)208 was calculated in-house by ChEMBL. QED values range from 0-1, and the closer a QED value is to 1, the more oral-drug-like is the compound under consideration. QED values ranged from 0.01-0.95, the median value was 0.60, and the interquartile range was 0.43-0.74. (ii) Maximum development phase was recorded for each compound. The distribution of scores was: 0 (99.7%), I (0.02%), II (0.03%), III (0.05%), and IV (0.22%). (iii) Route of administration is based on information obtained from drug prescribing information packages. Compounds with oral or topical routes were given a score of 1. 0.17% of compounds had a route of administration score of 1. (iv) Black Box Warning information. The FDA provides Black Box Warnings for approved drugs where use of the drug is associated with serious or life threatening . For the compounds identified, a score of -1 was used to penalise any compounds with Black Box Warnings. 0.07% of compounds had a black box warning score of -1. (v) Molecule structural information availability. A score of 1 was assigned to compounds with at least one structure deposited in the in Europe (PDBe) (http://www.ebi.ac.uk/pdbe/). 0.76% of compounds had a molecule structural information availability score of 1.

44 (vi) Toxicology target interaction prediction. The ChEMBL toxicology target prediction pipeline, with prediction models created using activity information in the database collected as part of the HeCatos project (http://www.hecatos.eu), was used to predict compounds that are likely to interact with known toxicology targets. Such targets were penalised with a score of -1. 0.004% of compounds had toxicology target interaction scores of -1.

Diversity analysis for creating a ‘diverse screening set’ Of our 289 (15%) highest-scoring helminth targets, 286 had compounds in ChEMBL (292,499 unique compounds). To filter these ChEMBL compound–target pairs, two additional parameters were retrieved from ChEMBL and used for filtering:

(i) pCHEMBL is a parameter calculated by ChEMBL that provides a consistent measure of the affinity of a compound for its (ChEMBL) target and is defined as: −log10 (molar IC50, XC50, EC50, AC50, Ki, Kd or ). For example, pChEMBL > 5 corresponds to an activity > 10 μM. (ii) Compound and target in a structure. Where possible, structures were retrieved from the Protein Data Bank in Europe (PDBe) (http://www.ebi.ac.uk/pdbe/) that contained both the compound and a ChEMBL target. The 292,499 compounds were filtered by selecting compounds that (i) co-appeared in a PDBe205 structure with the ChEMBL target; or (ii) had a high pChEMBL score (median >5), reflecting high potency/affinity for the ChEMBL target). This filtering left 131,452 ‘top drug candidates’. To create a ‘diverse screening set’, we performed a diversity analysis to classify the 131,452 ‘top drug candidates’ into chemical classes. The data set consisted of the 131,381 (of these 131,452) compounds for which a SMILES string was available in ChEMBL22_1 or CHEMBL16, along with the 263 SMILES strings for 255 known anthelmintic compounds used for the dendrogram of anthelmintic compounds (see above), giving 131,644 SMILES strings. ChEMBL16 was used to obtain SMILES strings for some metal-containing compounds that are lacking SMILES strings in ChEMBL22. There were 71 of the ‘top drug candidate’ compounds (for example, polymers) for which there were no SMILES in ChEMBL, so were not included in the diversity analysis. The RDKit software (used to create fingerprints based on the SMILES) rejected the SMILES for five SMILES, leaving 131,376 SMILES for top drug candidates, and 263 for known anthelmintic compounds (131,639 total). The pairwise similarity between the 131,639 SMILES strings was calculated, and the similarities converted to distances, as for the dendrogram of known anthelmintic compounds (see above). This number of compounds was too large to construct a dendrogram using hierarchical clustering. Instead, by manually examining the clusters in the dendrogram of known anthelmintic compounds, we found that compounds in the same cluster generally had distances of ≤0.65 from each other, and those in different clusters usually had distances of >0.65. Thus, a set of clusters was found by first constructing a graph in which each compound is represented by a node, and each pair of nodes was joined if the distance between the compounds ≤0.65. The resulting graph consisted of 366 connected components containing 130,525 compounds, and 1114 singleton compounds that were not placed in any connected component (or could be considered single-node components).

45 One of the 366 components was very large (128,794 compounds) and included relatively dissimilar compounds among the known anthelmintics. Therefore, we further split the connected components by using a community (cluster) detection algorithm to find clusters within the subgraph corresponding to each connected component. To find the communities, we used the ‘community’ Python module209 to find optimal communities in terms of the modularity measure (as described previously210), using the similarities between compounds as the edge weights. When the community detection algorithm was applied to the 366 connected components in the graph of 131,548 top drug candidates and anthelmintic compounds, it split the 366 components into 1894 communities (clusters). Upon manual examination, it was found that some of these 1894 clusters still contained relatively dissimilar compounds among the known anthelmintic compounds. Therefore, for each cluster, we performed hierarchical clustering in R using Ward's minimum variance method, and cut the dendrogram at a height of 0.85. The value of 0.85 was chosen by trying various heights, and finding the height at which anthelmintic compounds known to have similar structures were placed in the same cluster. These clusters are quite tight: for example, the are split up into several smaller clusters. The 263 SMILES strings for 255 known anthelmintic compounds were classified into 193 relatively narrow chemical classes using this approach. This procedure resulted in 26,811 clusters, of which 24,738 contained more than one compound, and the other 2073 were singleton clusters. Therefore, including the 1114 singleton clusters mentioned above, the 131,639 compounds (SMILES strings) were placed into 27,925 clusters: 24,738 clusters of >1 compound plus 3187 singleton clusters. Considering just our 131,376 ‘top candidate compounds’ from ChEMBL (for which we had SMILES strings), these were placed into 27,868 clusters: 24,670 clusters of >1 compound and 3198 singleton clusters. Thus, we considered that the 131,376 ‘top candidate compounds’ belonged to 27,868 chemical classes. Assuming the 76 that could not be classified each belong to their own class, yields a total of 27,944 classes. At a later stage, some of the 131,452 ‘top drug candidates’ were further filtered, by discarding the medicinal chemistry compounds that did not co-appear in a PDBe structure with the ChEMBL target or have a median pCHEMBL score of >7, which left 52,154 top drug candidates. The same chemical classes identified above for the 131,452 compounds were still used for the smaller set of 52,154 candidates.

Identifying compounds available for purchase using ZINC15 ZINC 15211 was used to identify compounds available for purchase in the dataset. The subset of ZINC 15 with the highest level of availability (‘in stock’) was used. Identity-matching used the parent compound (a single largest component) and standard InChIs, first directly and then after removing the charge and atom-based stereochemistry layers. Of 131,376 ‘top candidate compounds’ searched for (the subset of the 131,452 ‘top candidate compounds’ for which we had SMILES strings), 26,573 (20.2%) were present as exact matches in ZINC, and 33,289 (25.3%) were present after removing charges and stereochemistry information (but not geometric isomers).

46 Self-organising map of compounds To better understand the diversity of the 5046 compounds in the diverse screening set, we constructed a self-organising map (SOM) of these compounds and the 263 known anthelmintic compounds for which SMILES strings were available. The SOM was built using the R package Kohonen v3.02212 in R v3.3.0, using a 20x20 cell hexagonal, non-toroidal grid. Training was based on optimising Tanimoto distances between molecular (ECFP4) fingerprints. Various training schemes were tested, but the SOM presented here was based on training for 4,000 steps using the default training scheme of the Kohonen package. Further analysis and visualization of the map used custom R scripts.

47 Supplementary Results

1. Genomic diversity in parasitic nematodes and platyhelminths

1.1 Genome sequencing and assembly Authors: Avril Coghlan, James Cotton, Kimberlie- Hallsworth-Pepin, Nancy Holroyd, John Martin, Phillip Ozersky and Isheng Jason Tsai and Xu Zhang

Sequencing strategy Many nematode and platyhelminth species are poorly studied and optimised protocols to obtain sufficient high-quality samples for extracting and sequencing DNA do not exist in many cases. Several highly contiguous ‘reference-quality’ genomes were previously published, for which the assemblies underwent laborious manual finishing, including Schistosoma mansoni213, Trichuris muris53, Trichinella spiralis106, Strongyloides ratti44, and Onchocerca volvulus92. Using the N50 as a measure of assembly contiguity (the scaffold size for which 50% of bases are found in scaffolds of this size or longer), these assemblies have N50s in the range 0.4 Mb (T. muris) to 32.1 Mb (S. mansoni). In contrast, because manual finishing of assemblies is extremely labour-intensive and time consuming, our approach for this project has been to produce draft genomes, and use the reference genomes to aid in gene-finding. As part of the present study, 45 genomes were sequenced (36 at WTSI for this project, 6 by MGI and 3 by BaNG, Supplementary Tables 1 and 22, Supplementary Fig. 15) and these were combined with 36 published or publicly available genomes, enabling a comparative analysis of 81 nematode and platyhelminth species (Supplementary Table 3). The combined set of species spans each of the major clades of nematodes (clade I, III, IV, V) and flatworms (cestode, trematode), enabling clade-specific genomic changes (e.g. gene family expansions) to be identified.

Genome assembly pipeline validation To assess the accuracy of the data production methods, the reference C. elegans N2 strain was included in the WTSI sequencing, assembly and annotation pipelines (Supplementary Tables 1 and 22). The resulting assembly was 101.2 Mb, similar in size to the reference genome assembly (100.3 Mb; WormBase 235). The CEGMA score91 was 98.3% for full-length matches (99.1% for partial matches), suggesting the assembly includes most conserved genes. Furthermore, the WTSI gene prediction pipeline (see below) produced 19,282 gene predictions (c.f. 20,483 genes in WormBase WS238134). Taking the longest splice-form for each WormBase WS238 gene, 96% had a BLASTP102 match (with E<1e-5) to the gene set predicted on our de novo assembly, indicating that the assembly is virtually complete, at least with respect to coding DNA.

48 The C. elegans assembly produced using the automated pipeline did not contain obvious interchromosomal misassemblies involving large genomic regions, but did contain some interchromosomal misassemblies involving smaller chunks, as well as some intrachromosomal miss-assemblies. This level of misassembly is relatively low given that this is a draft genome (without long range sequencing and mapping data) and that the primary focus of the present study is on gene content for each species.

Assembly statistics

Assembly size and GC content. For the 45 species sequenced for this project, the assembly sizes varied from 49.1 Mb (Trichinella nativa) to 834.6 Mb ( caproni; Supplementary Table 1; Extended Data Fig. 1). These are within the range of assembly sizes previously published for nematodes and platyhelminths, from 19.7 Mb (Pratylenchus coffeae214 to 1258.7 Mb (Spirometra erinaceieuropaei215), although freeliving platyhelminths such as Otomesostoma auditivum have been estimated to have genomes as large as ~18.4 Gb216. The GC contents of the 45 species are also within the range seen for published species, from 21.3% for Strongyloides ratti44 to 44.5% for Trichuris muris53 (Supplementary Table 1).

Genome coverage. 32 out of 45 species had corrected CEGMA scores >85% (see Methods: Assembly QC), suggesting their genome assemblies are relatively complete (Supplementary Table 1). In 22 cases the corrected CEGMA score was >95%, comparable to those published for many reference genome assemblies that have undergone manual improvement.

Assembly contiguity. Of the 45 species sequenced for this project, the N50 varied from 1.2 kb to 1.3 Mb (median 25.8 kb, interquartile range 10.1–49.9 kb; Supplementary Table 1). However, because genome sizes vary greatly across nematodes and platyhelminths, the ratio of the N50 to the number of scaffolds provides a better measure of assembly contiguity; for example, an assembly with an N50 of 10 kb and 1000 scaffolds is highly contiguous (N50/scaffold-count = 10), while an assembly with an N50 of 10 kb and 100,000 scaffolds (N50/scaffold-count = 0.1) is highly fragmented. Thirty of the 32 genomes with relatively high CEGMA partial scores (≥85%) had N50/scaffold-counts of >0.3. The 13 most fragmented assemblies (N50/scaffold-count < 0.3) had lower CEGMA scores, as expected, suggesting missing genes or that scaffolds are too fragmented for the CEGMA genes to be found (Supplementary Table 1).

1.2 Gene-finding and annotation Authors: Avril Coghlan, Kimberlie Hallsworth-Pepin, Nancy Holroyd, Kevin Howe, John Martin, Phillip Ozersky, Eleanor Stanley and Xu Zhang

49 Gene-finding A total of 797,863 genes were predicted for the 45 species (Supplementary Fig. 16). The assemblies and gene sets are available via WormBase ParaSite135.

Gene count

Effect of assembly fragmentation on gene count. The number of genes per species varied from 9,831 to 37,906 across the 45 species that we sequenced (median 15,892; Supplementary Table 2; Extended Data Fig. 1). There was a negative correlation between gene count and assembly contiguity, as measured by N50/scaffold- count (Spearman’s ρ=-0.51, P=0.0005, n=45). In other words, species with fragmented assemblies tended to have higher gene counts, probably because many true genes are split into multiple (partial) gene models. In support of this, there was a strong negative correlation between the gene count and median protein length, showing that high gene counts are due to a higher abundance of split or partial genes (Spearman’s ρ=-0.82, P

Normalised gene counts. For the most contiguous assemblies, the observed gene count converges to ~9,000-16,000, suggesting the true gene count is in this range for most nematodes and platyhelminths, and that high gene counts of >25,000 are probably artefacts due to assembly fragmentation. To correct for the effect of assembly fragmentation on the gene count, gene counts were normalised for each of our 45 species, by dividing the total proteome length for a species by the mean protein length for C. elegans (409.82 amino acids). The normalised gene counts varied from 5,636 to 19,093 (Supplementary Table 2). The 5-95% percentile range of the re-estimated counts is 9132-17,274, which agrees well with the 9,000-16,000 estimate for a typical nematode or platyhelminth. The range is however, still lower than the 20,483 genes found in C. elegans (WS238). Some of the ~4,500 additional C. elegans genes are due to known gene family expansions in the Caenorhabditis-lineage (for example of GPCRs217), but some may be hard-to-find genes that have only been identified as a result of extensive manual curation of the C. elegans genome. Taking just the 33 tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7), there was no significant difference between the re-estimated gene counts in nematodes (5-95% percentile 10,704-21,142; n=24; median 12,426) and platyhelminths (11,215-16,988; n=9; median 12,515; Wilcoxon test: P=0.9). The four free-living nematodes (all tier 1 species) had higher re- estimated gene counts (median 19,973) than parasitic (tier 1) nematodes (median 12,224; Wilcoxon test: P=0.01), suggesting that either there has been gene loss from parasitic lineages or gene gain (e.g. from duplications or birth of novel genes) in free-living lineages. There was just one free-living platyhelminth amongst the 81 species (Schmidtea, not a tier 1 species) but its re- estimated gene count (17,096) was relatively high compared to other platyhelminths.

50 Assessing gene-finding accuracy using curated gene sets The accuracy of the MAKER gene prediction pipeline used on 36 genomes at WTSI (Supplementary Fig. 16a) was assessed by using it on on the C. elegans reference assembly (WormBase WS238), without support from alignments to existing C. elegans data. Compared to the highly curated C. elegans genes from WormBase, MAKER over-predicted the number of genes (23,902 vs 20,483) and 77% of (unique) WormBase coding exons were predicted with identical coordinates when all coding exons (CDS features) across all splice-forms were considered. The 77% exon specificity obtained using the present WTSI MAKER pipeline is lower than could be obtained using gene-finders specifically trained on C. elegans data218. It is, however, a good compromise given our requirement to annotate a wide range of nematode and platyhelminth genomes, without extensive training data for each one. While accurate prediction of exon boundaries would be ideal, a more practical requirement for gene finding in draft genomes, is that the majority of true amino acids are included in the predicted proteins. Even without RNAseq evidence, the WTSI MAKER pipeline predicted C. elegans genes with 91% sensitivity at the amino acid level (i.e. 91% of true codons were included with the same coordinates and reading frame in the predicted gene set). To assess the impact on protein annotation (e.g. GO terms, protein names) the Pfam domains from WormBase proteins were compared to those in our predicted C. elegans proteins. Of 3,822 unique domains (i.e. unique Pfam identifiers) from WormBase, 96% were found in one or more of our predicted C. elegans proteins, showing that MAKER captured most of the expected functional regions. The accuracy of the WTSI MAKER pipeline was further assessed using the assemblies for two published species, S. mansoni213 and E. multilocularis219. Compared to partially curated gene models from GeneDB108, the MAKER pipeline had exon-level sensitivities of 54% and 59% for S. mansoni and E. multilocularis, respectively. The lower exon-level sensitivity, compared with C. elegans, likely reflected the limited curation that these flatworm gene sets have received and the greater difficulty in predicting flatworm genes, which have particularly challenging ‘micro-exons’ and numerous long introns213,219. However, the amino-acid level sensitivities compared well to the sensitivity seen in C. elegans, with figures of 82% and 85% for S. mansoni and E. multilocularis, respectively.

1.3 Variation in genome size Authors: Avril Coghlan and James Cotton Among the 81 nematode and platyhelminth species, there is a ~30-fold variation in genome size, from 43 Mb (Parastrongyloides trichosuri to 1259 Mb (Spirometra erinaceieuropaei; Supplementary Table 1; Extended Data Fig. 1). Even between closely related species there is wide variation, for example 341–702 Mb just in the schistosomatid clade of flatworms. Considering the tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7), platyhelminth genomes are larger than those of nematodes (medians 365 and 100 Mb; n=9 and n=24; Wilcoxon test: P-value 0.007). Within nematodes, genomes of clade V species tend to be the largest; cestodes usually have similar genome sizes to nematodes (medians 139 Mb versus 100 Mb), but trematode genomes are larger (median 385; Wilcoxon test: P-value 10-4;

51 Supplementary Table 1). There was no difference between the genome sizes of free-living and parasitic (tier 1) nematode species (medians 83 and 102 Mb; Wilcoxon test: P-value 0.3). Some possible reasons for large differences in genome size amongst the nematodes and platyhelminths are: whole (or partial) genome duplication; expansion of repetitive DNA such as transposable elements; increase in intron content; or expansion of gene families. We investigated each of these possibilities in turn below.

Evidence for recent genome duplication events The nematode Meloidogyne incognita is hypothesised to be a polyploid that arose due to interspecific hybridization (i.e. an allopolyploid), and analyses of the M. incognita genome revealed many gene loci present in divergent copies220,221. Cytogenetic studies have shown that some other Meloidogyne species are diploid or triploid, with variation in ploidy even within species222. Within flatworms, polyploid strains of species of , and Schmidtea have been observed, and in the case of Fasciola and Paragonimus may have arisen by interspecific hybridization223. To investigate whether any of the sequenced helminth genomes reveal traces of recent genome duplication events, we examined the ‘low copy’ (lc) CEGMA genes that are present as single copy in almost all eukaryotes (CEGMA version 2.4105). The average number of copies for all of the lc CEGMA genes (average CEG number) should be close to 1.0 unless there has been a whole or partial genome duplication (polyploidisation), large-scale gene duplication has occurred (e.g. in a non-polyploid species), or allelic copies have been (incorrectly) assembled independently (e.g. due to high polymorphism). For example, Meloidogyne incognita (downloaded from WormBase release WS245) has an average CEG (partial) number of 1.70, and stands out as a statistical outlier among clade IV nematodes (Supplementary Table 1). This approach will only identify relatively recent genome duplication events, since many duplicates that arose from an ancient genome duplication will probably have been subsequently lost. In clade III nematodes, pulchrum is an outlier, with an average CEG number of 1.53 (Supplementary Table 1). However, the assembly for this species is very fragmented (N50/scaffold-count = 0.03), and when shorter scaffolds of <20 kb were excluded, the average CEG number was 1.09, near the expected value of 1.0. The same effect was observed for the flatworms Spirometra erinaceieuropaei (N50/scaffold-count = 0.01; CEG number 1.80; CEG number 1.08 for scaffolds of ≥20 kb) and H. taeniaeformis (N50/scaffold-count = 0.42; CEG number 1.48; CEG number 1.07 for scaffolds ≥20 kb). The high CEG number for these three species must be an artefact resulting from many small scaffolds carrying apparently duplicated genes, which in fact represent uncollapsed haplotypes. In contrast, the average CEG number showed little dependence on scaffold length for Meloidogyne incognita, whose genome was probably duplicated (CEG number 1.70; CEG number 1.57 for scaffolds of ≥20 kb). Among clade V nematodes, Teladorsagia circumcincta is an outlier with respect to average CEG number (Supplementary Table 1), and average CEG number showed relatively little dependence on

52 scaffold length (CEG number 1.88; CEG number 1.38 for scaffolds of ≥20 kb). The T. circumcincta genome was sequenced based on DNA from multiple individuals, so the high CEG number for this species is probably an artefact due to (incorrectly) uncollapsed haplotypes rather than a genome duplication. Among the flatworms, Schmidtea mediterranea and have unusually high CEG numbers (1.98 and 1.42; Supplementary Table 1), and these are not dependent on scaffold length (CEG numbers 1.57, 1.32 for scaffolds of ≥20 kb). As mentioned above, polyploid isolates have been observed in these genera223. However, the Schmidtea genome assembly used (v3.1) is thought to include many uncollapsed haplotypes. In the original F. hepatica genome publication224 the authors found no evidence of genome duplication based on CEG number and read coverage, although their assembly was for a different isolate than the one studied here. For a recent genome duplication, large duplicated blocks of genes would be expected. However, looking at paralogues that have arisen since divergence from Echinostoma caproni (the closest relative in our data set), no large scaffolds could be found with convincing evidence of genome duplication (i.e. linked by ≥5 pairs of paralogues). Thus, it seems more likely that the higher than expected CEG number reflects many small-scale gene duplications rather than a genome duplication event.

Effect of repeat content on genome size Finding repeats. There is huge variation in repeat content in published nematode and platyhelminth genomes, varying from <1% for Pratylenchus coffeae214 to nearly 50% in species such as Romanomermis culicivorax (48%225) and Schistosoma mansoni (45%213; Supplementary Table 5). To annotate transposable elements and other repeat libraries in our assemblies, we built new repeat libraries for each species, and masked the assemblies using RepeatMasker. Of 36 published or publicly genomes for nematodes and platyhelminths, the repeat content was previously published for 32 species, and despite differences in repeat library construction and assembly versions, for 24 of these 32, our repeat content values were within 5% of the published values (Supplementary Table 5). For three species the difference was greater than 10%: S. mansoni (published 40%, we found 50.2%), F. hepatica (published 55.3%; we found 67.9%), and C. sinensis (published 25.9%, we found 38.0%), perhaps because our strategy combined three repeat-finders (RepeatModeler, TransposonPSI, LTRHarvest). The 45 new assemblies contained 2.5–67.9% repeats, exceeding the maximum value of ~50% previously reported for nematodes and platyhelminths (Supplementary Table 5). Some of the variation in repeat content may be due to differences in sequencing and assembly approaches; for example, some assemblers discard repeats. Effect of repeats on assembly size. Among our new assemblies, larger assemblies tended to be more repeat-rich, for example, the repeat-rich flatworms regenti (59.5%) and Schistocephalus solidus (53.8%) have large assemblies (702 and 539 Mb, respectively). However, it is possible that these assembly sizes may underestimate the true genome size. To determine the extent to which this has happened, reads from short-insert sequencing libraries were remapped to the 36 WTSI assemblies. Based on read depth, the total amount of extra sequence contained

53 within collapsed repeats was estimated and used to provide new estimates for genome size (see Assembly QC in Methods; Supplementary Table 5). For three species, >20% of the original assembly size would be gained by ‘uncollapsing’ those repeats: (23.9%), Soboliphyme baturini (25.8%), and Schistocephalus solidus (30.2%). The A. lumbricoides (germline) genome contains many copies of satellite repeat that are lost from the DNA of somatic cells by a process of chromatin diminution226, and these may have been collapsed in the assembly. The collapsed repeats in the clade I nematode S. baturini and in the flatworm S. solidus may consist of centromeric repeat, as seen in Trichuris muris53, since nematodes in clade I are monocentric, as are many flatworms227. Alternatively, the collapsed repeats in S. baturini — a gonochoristic species with an unknown karyotype — may be part of a , as seen in Trichuris muris53. Effect of chromatin diminution on assembly. In the early development of clade IIIb nematodes (Ascaridomorpha), the genomes of their somatic cells are reduced by chromatin diminution39,226,228- 230. In Ascaris suum, for example, a 123-bp satellite repeat, comprising about 13% (43 Mb) of the germline genome DNA is eliminated from somatic cells229. In the present study, the A. lumbricoides DNA from testes (Supplementary Table 22) will have largely comprised germline cells (non- diminished genomes) and contained a large amount of collapsed repeat (see above; 76 Mb; Supplementary Table 5). The A. suum assembly analysed also corresponds to the germline genome229. In contrast, the P. equorum and T. canis genomes were sequenced from whole individuals and likely correspond to the somatic (diminished) genome. Thus, repeat content has probably been underestimated in species that undergo chromatin diminution. Variation in repeat content. We see huge variation in repeat content across the 81 nematodes and platyhelminths, from 3.8-54.5% (5-95% percentile) when the Ascaridomorpha and Schmidtea are excluded since their repeat content estimates were probably inaccurate due to chromatin diminution or heterozygosity, respectively (Supplementary Table 5). Even allowing for re- estimated repeat content, within the nematodes, clades III and IV are relatively repeat-poor (90% are ≤20% repeat), while clades I and V are repeat-rich (90% are ≥19% repeat; Supplementary Table 5). The repeat content of cestodes (median 14.2%, range 3.2-53.8%) is similar to that of nematodes (median 18.1%, range 2.8-62.1% excluding Ascaridomorpha), but trematode genomes are far more repeat-rich (median 50.2%, range 38.0-67.9%). Several species that are outliers with respect to genome size are also outliers with respect to repeat content: in clade III, Heligmosomoides bakeri in clade V, and Fasciola hepatica among the trematodes (Supplementary Tables 1, 5). Massive expansions of particular transposable element families seem to be responsible for the very high repeat contents of these outliers. G. pulchrum has a far higher content of LTR repeats and DNA elements than other members of clade III; and H. bakeri has an unusually high amount of LINEs and DNA elements (Supplementary Table 5). F. hepatica does not have a high content of any known transposon class, but has a very high content of unknown (unclassified) repeats (84.5% of the genome), as do G. pulchrum, S. mediterranea and P. xenopodis (Supplementary Table 5). These are possibly transposable elements of unknown class. An analysis of a subset of the nematode genomes studied here concluded that large expansions of transposable element families in particular nematode lineages are due to genetic drift, rather than factors relating to life history or RNAi231.

54 Relationship between repeat content and genome size. There is a roughly linear relationship between repeat content and genome size for the 81 nematode and platyhelminth species, suggesting repeat content plays a major role in determining genome size. This is expected, as a positive correlation between repeat content and genome size in eukaryotes has been previously reported14. In our data, the correlation is high (Spearman’s ρ=0.96, P<10-15), but the data points are not independent as many species are closely related. In the C. elegans genome, there is 83.9 Mb of non-repeat (non-repeat-masked) DNA, and one might imagine that other species have the same amount of non-repeat DNA, and that extra repeat might account for increases in genome sizes relative to C. elegans. However, we do not find this to be the case: the amount of non- repetitive DNA also increases with genome size (Spearman’s ρ=0.92, P<10-15). Thus, large genomes may arise in species in which purifying selection is relaxed, in which both expansions of transposable elements and of non-repetitive DNA (e.g. exon or unique non-coding DNA) are tolerated.

Effect of coding DNA and intron content on genome size Coding DNA content and genome size. There are several species that are outliers with respect to genome size but not repeat content, so other factors such as coding DNA content must play a role in their large genomes: Romanomermis culicivorax in clade I; Ascaris lumbricoides, A. suum, and Toxocara canis in clade III, Globodera pallida in clade IV; Teladorsagia circumcincta in clade V; and Spirometra erinaceieuropaei, Schistocephalus solidus and Dibothriocephalus latus among cestodes (Supplementary Tables 1 and 5). As mentioned above, the repeat content and genome size estimates for the Ascaridomorpha are probably inaccurate due to chromatin diminution. Of the others, R. culicivorax, S. erinaceieuropaei, and S. solidus are all outliers with respect to coding DNA content in their respective clades (Supplementary Table 2). All of these have unusually high gene counts compared to other members of their clade (after correcting for assembly fragmentation; Supplementary Table 2). Their high gene count seems to be partly due to gene duplications, as the number of paralogues is particularly high in those species (Supplementary Table 2). Our CEGMA analysis (see Evidence for genome duplications above) showed that although S. erinaceieuropaei has a high CEG number, it appears to be an artefact resulting from many small scaffolds carrying apparently duplicated genes, which are actually uncollapsed haplotypes. Like S. erinaceieuropaei, R. culicivorax and S. solidus both have a relatively high level of assembly fragmentation (all have N50/scaffold-count ≤0.6), so possibly also suffer from this issue. However, R. culicivorax has a greater number of gene families that contain C. elegans members compared to other clade I nematodes (4,728 in Romanomermis versus 3,685-4,155 in other clade I species), so may have a larger gene set because it has retained ancestral nematode genes that other clade I species have lost. Intron Content and Genome Size. An unusually large amount of intron DNA compared to other members of their clades seems to have contributed to the large genome sizes of Romanomermis culicivorax, Ascaris lumbricoides, A. suum, Toxocara canis, Teladorsagia circumcincta and Schistocephalus solidus (Supplementary Table 2). This may partly be due to an unusually high total number of introns in Romanomermis culicivorax, Ascaris lumbricoides and A. suum compared to other members of the same clade, probably due to their high gene counts (Supplementary Table 2). High gene counts may be due to gene loss from other clade I species in the case of R.

55 culicivorax (see above), and due to diminution in other Ascaridomorpha. These species also have unusually long median intron size compared to other members of the same clades (medians 1,345 bp, 979 bp, 1,031 bp; Supplementary Table 2). In addition, T. canis also has unusually long introns (median 738 bp), as do Teladorsagia circumcincta and Schistocephalus solidus (medians 340 bp, 1,264 bp). There could be various reasons for their high median intron size: expansion in size of existing introns (whose positions in genes are conserved with other species), or gain of novel long introns, or loss of short introns by these species; or alternatively, due to shrinking of conserved introns, gain of short introns or loss of long introns by other species.

Contribution of repeats, genes, introns and intergenic DNA to genome size variation Multiple regression model of genome size. As many of the factors that are correlated with genome size are themselves correlated with each other, we built a statistical model to identify the most important factors driving genome size variation in these data. As an initial (complete model) we began with a set of covariates that seemed likely to be independent of each other, including variates designed to capture variation in gene content (number of genes, total exon length and total intron length), the degree of gene duplication (number of paralogues), the quality of assembly (N50n: the number of scaffolds whose summed length is N50) and the repeat content (total SINE length, total LINE length, total length of LTR elements, total length of DNA elements, total length of small RNA genes, total satellite repeat length, total simple repeat length and total length of low- complexity sequence). GC content would not be expected to affect genome size and was included as a control. Stepwise regression was used to identify a minimal model that best explained the data, by adding or removing terms at each step that had the largest effect on the Akaike Information Criterion. The resulting model contained eight terms including the intercept, all of which were significant at P<0.01, except for the total length of small RNA genes. The latter was subsequently dropped to leave a minimum adequate model of six terms (in order of increasing P- value: total LTR length, total length of simple repeats, N50n, total length of DNA elements, total length of introns, total length of low complexity sequence) plus a significant intercept term. Bayesian mixed-effect model of genome size. A complication with this model is that the species vary in their degree of relatedness; for example, Strongyloides is represented by multiple species but species such as Romanomermis, Dracunculus and Protopolystoma are the only exemplars of large parasite groups and are only distantly related to other species. This effectively leads to non- independence of the species. To address this, a phylogenetic mixed-effect model was built using the R package MCMCglmm141, in which the species tree was used to specify the expected covariance between species and so correct for this non-independence. Including the phylogenetic effect showed similar results to the multiple regression model, with the phylogenetic component only explaining approximately 9% of the variance in genome size, although this estimate had very wide confidence intervals (0.02%-49%). Supplementary Table 6 shows the coefficient estimates, and MCMC P-values for tests that the coefficient estimates are significantly different from zero. These models confirm that repeat content, in particular of some transposable element content (DNA elements and LTR-containing elements) and simple repeats, explains much of the variation in genome size. There is also an effect of assembly quality (N50n), but the number and length of coding regions does not appear to play a significant role. Surprisingly, a high content of low complexity DNA is significantly associated with smaller genome size; clade III and IV nematodes

56 have a relatively high fraction of low complexity DNA (although still only 2-3% of the genome; Supplementary Table 5) but relatively small genomes (Supplementary Table 1). Genome size variation in parasitic nematodes and platyhelminths appears to be largely due to changes in a number of different non-coding elements, suggesting it is either non-adaptive or responding to selection only at the level of overall genome size.

1.4 Mitochondrial Genome Evolution Author: Taisei Kikuchi Genome size and genes. We assembled the mitochondrial genome sequences for 50 species (including nine species whose nuclear genomes were previously published) and analysed them with the mitogenomes of 25 species from previous studies (Supplementary Table 24a; Supplementary Methods). For four of our newly sequenced species we were not able to assemble mitogenomes, nor were we able to assemble or obtain mitogenomes for two published species, so we analysed mitogenomes for 75 of the 81 nematodes and platyhelminths. Most of the 75 mitochondrial genomes are compact (~13-17 kb), containing 12 or 13 protein-coding, 2 rRNA and 21-24 tRNA genes (Supplementary Figs. 20, 21, 22). One exception is the mitogenome of Romanomermis culicivorax, the most diverged member of clade I (Fig. 1), which is 26,194 bp and contains 14 protein-coding (due to a duplication of the ND3 gene) and 32 tRNA genes. Some mitogenomes, including those of Litomosoides sigmodontis and Syphacia muris, contain a smaller number of predicted tRNA genes (17 and 20, respectively), although this could be due to assembly incompleteness. As previously reported44, some nematode mitochondrial genomes are fragmented: in Rhabditophanes sp. KR3021 the genome consists of two circular molecules that together encode a full set of proteins, tRNAs and rRNAs, and in Globodera species multiple circles encode the full complement but also pseudogenised fragments146. Cestodes, trematodes and nematodes lack ATP8 genes, except clade I nematodes232; ATP8 genes have previously been found to be be absent or highly modified in several metazoan taxa including Chaetognatha, Rotifera, and most bivalve molluscs233. GC contents and GC skew. The mitochondrial genomes are AT-rich, ranging from 15.9% G+C in Bursaphelenchus xylophilus to 39.9% in . Generally, non-coding regions have lower GC content than coding regions, except in Clonorchis sinensis, immitis, Romanomermis culicivorax and Fasciola hepatica. The lowest GC content, in B. xylophilus, is due to the presence of long non-coding regions that have very low GC content (1.1% in 2,118 bp). Thus, while most metazoan mitogenomes contain a long non-coding ‘control region’234, it seems that that of B. xylophilus is particularly long. Phylogeny. Three phylogenetic trees based on concatenated alignments of 12 protein-coding genes were constructed for nematodes, trematodes and cestodes, respectively (Supplementary Figs. 20, 21, 22; Supplementary Methods). Cestode and trematode mitogenome trees show similar (with small differences, probably partly due to differing species sets) topologies to those based on nuclear rRNA genes (18S or 28S)235,236. The nematode mitochondrial tree is not consistent with 18S phylogeny, particularly for clade III nematodes, which cluster into two groups in the mitochondrial tree, as previously reported237 (Supplementary Fig. 20). However, it is known that mitochondrial rearrangements (which are common in clade III; see below) give rise to changes

57 in sequence evolution238, making inference of the species tree based on mitogenomes extremely difficult239. Gene order. In animals, mitochondrial gene order generally remains stable over time240. In nematodes and platyhelminths, large changes in gene order seem to have occurred as a punctate event a few times over the species tree. In cestodes and trematodes, gene order is relatively conserved (even between these two classes), with only two big rearrangements observed in Schistosoma species (Supplementary Figs. 22, 21). On the other hand, we observed a wide range of gene orders in nematodes (Supplementary Fig. 20). Gene orders are relatively well conserved in clade I and V nematodes; however, rearrangements were observed in clades IVa and IVb, as previously reported for Strongyloides44. Clade III gene orders could be categorized into three patterns237, for Oxyuridomorpha (clade IIIa), Spiruromorpha+other (IIIc) (except Dracunculus), and Ascaridomorpha (IIIb) respectively. Although Dracunculus is sister to the Spiruromorpha241, it has a gene order similar to the Ascaridomorpha, clade V and some members of clade IV (sharing these orders: COX2_I-rRNA_ND3_ND5_ND6_ND4L_s-rRNA, CYTB_COX3_ND4 and ND1_ATP6), which may represent the ancestral order for nematodes III, IV and V (Supplementary Fig. 20). This suggests that the rearrangements in the mitogenomes of Spiruromorpha occurred after divergence from Dracunculus. While the Spiruromorpha have undergone a small number of rearrangements, there appear to have been multiple, complex rearrangements in clade IVa44 and in Globodera species in clade IVb146.

2. Novel gene families and gene family evolution

2.1 Compara database of gene families Authors: Avril Coghlan, James Cotton, Bhavana Harsha and Eleanor Stanley

Overview of the Compara database Due to splits or partial gene predictions in the more fragmented assemblies, high observed gene counts may not represent true increases in gene number. To get a more accurate picture of which gene families have expanded in different clades, a database of gene families was constructed using the Ensembl Compara pipeline15 based on sequence similarity. The database included 91 species: 56 nematodes (four free-living), 25 platyhelminths (one free-living), and 10 outgroup species from seven further animal phyla (Porifera, Annelida, Chordata, Cnidaria, , Arthropoda, Placozoa; Supplementary Table 4). From the combined dataset of 1,639,427 genes (91 species), 109,571 families were created, comprising 1,365,243 genes. A phylogenetic tree was built for each family and orthologs and paralogs were identified. From this list, 1,220 gene families were identified that contain genes with Pfam domains that are associated with transposable elements, and these were excluded from further analysis, leaving 108,351 families containing 1,345,329 genes (Supplementary Data File

58 1, Supplementary Table 7). Most families (66%) have 2-5 genes while 2,107 families have >100 genes, the largest family containing 1219 genes. Nematode and platyhelminth families that are missing from model organisms. Among the 108,351 families, there are 33,593 families that have ≥2 nematode species but lack C. elegans and 20,304 families that have ≥2 flatworm species but lack S. mansoni, indicating that there is a considerable breadth of nematode and platyhelminth gene functions that are absent from these model organisms.

Phylogenetic pattern of gene family presence and absence As expected, the pattern of gene family sharing between the 81 nematodes and platyhelminths largely reflects the phylogeny of these organisms. Particularly striking is the very large number of families specific to one or few species (Supplementary Fig. 1a): these are almost entirely small families with just one or a few genes per species (Supplementary Fig. 1b). The majority of families (76%) include ≤5 species, and most of these families with ≤5 species (86%) have ≤5 genes in total. Larger families are shared between the highest-level taxonomic groups (Supplementary Fig. 1a): for example 1665 gene families are shared by clade III and V nematodes but missing from clade IV, consistent with our species tree (Fig. 1). Other families (1,545) are shared by clades III, IV and V, and many families are shared by all nematodes and some outgroup species (1,556), but far fewer families (508) are nematode-specific and shared by the four nematode clades, confirming the very different gene content between the clade I nematodes and other nematode groups included here. There are also many gene families specific to Neodermata (Protopolystoma, trematodes and cestodes; 965) and there are many more (1,329) exclusive to just the endoparasitic groups (trematodes and cestodes). While this is intriguing biologically, it does not appear to reflect the arrangement proposed in our species tree, where Protopolystoma is placed as sister to the trematodes, as rather fewer (725 families) are specific to trematodes+Protopolystoma. Strikingly, few gene families (just 233) are unique to and shared by all the flatworms here, and the free-living Schmidtea has more families shared with outgroup species than with other platyhelminths. To quantify the degree to which gene family content reflects the phylogeny, we constructed a maximum-likelihood phylogeny based on gene family presence and absence for families that are not shared by all 81 species. This ‘family sharing’ phylogeny supports 62 out of 89 internal nodes in the amino acid alignment species tree (see Species tree below). In particular, these data agree with the controversial relationship between nematode clades III, IV and V that is hard to resolve in the amino acid phylogeny. Disagreements between the two datasets are mostly on relationships within and between our outgroups, and within closely-related groups of species such as the genera Trichuris, Hymenolepis and Schistosoma (Extended Data Fig. 2). These disagreements are presumably due to the low resolution of family content as a phylogenetic marker. Other areas of conflict are more striking, for example over deep relationships within clade III which are also poorly supported in our amino acid species tree.

59 The Compara protein-clustering was represented as a network, with species represented as nodes and the edges between two nodes weighted by the number of times proteins of both species appear together in a cluster. The applied layout successfully recovered broad phylogenetic groups within nematodes and platyhelminths, and positioned outgroups between nodes of both phyla (Supplementary Fig. 17). The edge with the highest weights connect the proteomes of the two closely related species A. suum and A. lumbricoides (which is expected, since they are very closely related species), followed by edges connecting several Schistosoma species nodes.

2.2 Synapomorphies Authors: Mark Blaxter and Dominik Laetsch We analysed the set of Compara gene families for families which constituted synapomorphies at key nodes relevant to within the phylogenetic tree of Metazoa (Supplementary Information: Methods 9). In total, 3,512 synapomorphic Compara families displayed a taxon- coverage of 100% (i.e. ‘complete’ synapomorphies), of which 16.9% could be assigned a representative functional (InterPro) domain annotation (Supplementary Table 8). Allowing for stochastic absence of species in gene families (defining 90–100% taxon coverage as ‘stochastic absent’ synapomorphies), 2,369 additional synapomorphic families could be identified, which displayed a higher percentage (24.0%) of representative functional domain annotation. The low percentage of representative functional domain annotations arises from the fact that the sequences in these families are highly diverged and in many cases do not contain members from well studied organisms. However, it is striking that many phylogenetically conserved gene families at deep nodes have escaped functional annotation. For the two phyla, nematodes and platyhelminths, we found 124 and 142 synapomorphic families, respectively, of which roughly half (51.6% for nematodes and 64.8% for platyhelminths) received a representative functional annotation (Supplementary Table 8). Families with complete taxon coverage across phyla were less frequent in nematodes (20) than in platyhelminths (71). Between the two phyla, clear differences between nematodes and platyhelminths in functional annotation of synapomorphies could be observed for those related to cellular structure such as cell adhesion, membrane and (Fig. 2). Nodes within platyhelminths exhibited higher numbers of synapomorphies related to cellular structure such as cell adhesion (46 synapomorphies), membrane (28), and (13) than within nematodes (3, 21, and 4, respectively). Furthermore, within platyhelminths the ‘births’ of these families seemed to have occurred in a more gradual and balanced manner along both major lineages, and . In nematodes, innovations related to membrane structure seemed to have occurred primarily at the last common ancestor of all nematodes, with piecemeal emergence of novelty along nodes of clade I, clade IVa, and clade V (Fig. 2). Within Neodermata (Protopolystoma, trematodes and cestodes), a clade-specific inositol- pentakisphosphate 2-kinase (EC 2.7.1.158, family 922204) was identified which might be involved in storage of phosphates in the form of calcium-hexakisphosphate, as in has been shown for the laminated layer of larval Echinococcus19.

60 Amongst nematodes, both clade III (one synapomorphy) and clade IV (5 synapomorphies) displayed independent synapomorphies related to fatty acid retinoid (FAR) binding, distinct from the Compara family containing Gp-FAR-1 described from Globodera pallida27. FARs have been previously suggested as vaccine targets242, since many parasitic nematodes rely on importing essential lipids from their host. Ferrochelatase is an enzyme that catalyses the terminal step of haem biosynthesis. Nematodes encode ferrochelatase-like (FeCL) proteins that are devoid of the active site, while certain onchocercid nematodes (B. malayi, A. viteae, D. immitis, and O. volvulus)17 and S. venezulensis18 harbour a second, apparently functional, copy of ferrochelatase acquired through horizontal gene transfer (HGT) from Alphaproteobacteria. The Compara family containing these functional HGT- ferrochelatases (family 787620) included taxa from clade III (with the exception of A. simplex, A. suum, A. lumbricoides, and P. equorum) and IV (with the exception of the three plant-parasitic species). This family also included two platyhelminths (S. mediterranea and T. regenti), two arthropods (D. melanogaster and I. scapularis), the sponge (A. queenslandica) and T. adhaerens. Phylogenetic analysis of the ferrochelatase Compara families 787620, 1184543, 850580, 740872, and bacterial ferrochelatases retrieved from NCBI (Supplementary Information: Methods 10) placed the additional, non-nematode ferrochelatases (with the exception of S. mediterranea) together with those of family 850580 (composed solely of platyhelminth and outgroup proteins) (Supplementary Fig. 2), suggesting clustering of these with the HGT-ferrochelatases was simply an artefact of the Compara analysis. Sequence similarity searches of the scaffold from which the S. mediterranea protein originated suggested it was a contaminant from an Alphaproteobacterium. The phylogenetic analysis also suggested that the acquisition of functional ferrochelatase by HGT predates the split of clade III and IV and that it was subsequently lost in clade V (Supplementary Fig. 2; Fig. 1). We also found a synapomorphic family (family 1184543) of clade IVa, annotated as ferrochelatase but lacking the active site. Sequences from this family were placed as sister to nematode FeCL orthologues (Supplementary Fig. 2). Currently, the functional role of these proteins is unknown17.

2.3 Gene family expansions Authors: Avril Coghlan, Matthew Berriman, James Cotton, Stephen Doyle, Nancy Holroyd, Adam Reid, Diogo Ribeiro, Gabriel Rinaldi, Bruce A. Rosa and Rahul Tyagi

Three separate scoring metrics and two clade definitions were used to identify gene families that are specific to, or expanded in, different clades and that therefore may underlie clade-specific traits. From a starting list of 10,986 families (each with ≥10 genes from tier 1 species, which are those with high-quality assemblies; Supplementary Information: Methods 7), 1,248 high families were found (Supplementary Table 9). Based on Pfam137 domain content, BLAST matches, and distribution across the species tree (since transposable element genes often have an unusual sporadic distribution), 58 of these 1,248 families were classified as likely transposable elements (Supplementary Table 9b). Four of these contained C. elegans genes with viral or transposable element origins. We discarded the 58 transposable element families from our analysis, as well as

61 an additional 30 families in which the gene count was as high (or nearly so) in outgroups as in nematodes and platyhelminths (Supplementary Table 9b). We also identified 165 families that were expanded in free-living nematode and platyhelminth species (Supplementary Table 9b).

Gene families expanded in parasites

Previously reported expansions. The remaining 995 families had expanded in parasitic nematodes and platyhelminths. These included 25 gene family expansions previously reported, 21 of which have proven or hypothesised links to parasitism (Supplementary Table 9a).

Kinase, protease, protease inhibitor, SCP/TAPS and GPCR expansions. Striking or moderately large expansions in parasitic nematodes and platyhelminths were found in 51 families annotated as kinases (15), proteases (8), protease inhibitors (8), SCP/TAPS (8), and GPCRs (12) (Supplementary Table 9a). Genes in these key functional classes may have important roles in parasitic life styles and are investigated in further detail in sections below.

Other gene family expansions. For the remaining 919 families, the distributions of gene counts per species were manually inspected across the species tree. We identified 168 families (described below) with convincing expansions across many, or large expansions only present in a couple of parasitic nematode and platyhelminth species. Since assembly and gene-finding artefacts can create apparent species- specific expansions, single species expansions of < 20 genes were ignored in most cases, except for those in clades (e.g. nematode clades I and IIIc) where fewer expansions were seen overall. Of the convincing expansions, 27 were particularly striking (Supplementary Table 9a).

Expansions of gene families involved in host-parasite interactions

Immunomodulation, immune evasion, and establishment in the host. In strongylid nematodes, there is a striking expansion of an apyrase (EC 3.6.1.5) gene family that includes C. elegans apy-1 (family 135728; 34% predicted secreted; 15-24 copies per species in Ancylostoma; 8-10 in Haemonchus; Extended Data Fig. 3a and Supplementary Table 9a). Release of ATP by host cells is a ‘Danger-Associated Molecular Pattern’ (DAMP), signalling cell or tissue injury. An apyrase related to apy-1 is secreted by some haematophagous and dampens this host ‘danger’ signal by hydrolysing host ATP243,244. Nisbet et al245 suggested that a Teladorsagia member of this family, which is present in excretory/secretory products of L4 larvae (as is a Heligmosomoides member246) and hydrolyses ADP/ATP, may play a similar immunomodulatory role. In addition, since many strongylids are blood feeders (Supplementary Table 12), apyrase may prevent purinergic activation of platelets, similar to its role in haematophagous insects, thus preventing clotting and permitting blood feeding. Interestingly, one clade of this Compara family has a novel Pfam domain combination, with both apyrase (Pfam:PF06079) and flavin-containing amine oxidoreductase (Pfam:PF01593, -like) domains (e.g. H. contortus HCOI00605900, N. americanus NECAME_12085, A.

62 caninum ANCCAN_11441). Because proteins with the amine oxidoreductase domain likely break down amines247, strongylid genes with this domain combination may modulate the levels of host amines, such as , another ‘danger’ signal indicating tissue damage23,248. We also find an independent smaller expansion of another apyrase gene family, containing human genes ENTPD1, ENTPD2, ENTPD3, ENTPD8, in the cestode Taenia (family 511981; 32% predicted secreted; 9-12 copies per Taenia species; 1-2 per Hymenolepis species; Extended Data Fig. 3b and Supplementary Table 9a). It was previously shown that the tegumental membrane of cysterci has an apyrase, which has been hypothesised to degrade nucleotides in the bloodstream of the host in order to inhibit platelet aggregation and inflammatory responses249. Another way that parasites may avoid or manipulate the host is to interact with immunoglobulins (). In S. mansoni and S. japonicum, respectively, Sm23 and Sj23 are surface-exposed proteins that bind the Fc domain of human IgG antibodies, which may help flatworm parasites to evade host immune recognition or complement activation25,250. A similar ‘non- immune’ (Fab-domain independent) binding of host IgG has been seen in bacterial such as streptococci251. Based on analysis of the S. mansoni genome, tetraspanin expansions have been described252 and in cestodes, we see a tetraspanin family containing human CD63 that has expanded (a subfamily of all tetraspanins; family 306209; 9-12 copies per Echinococcus species, 6-16 in Hymenolepis, 1-2 in Schistosoma; Extended Data Fig. 3c and Supplementary Table 9a). The many tetraspanin subfamilies of animals have been placed by Compara into separate Compara families, due to their divergence, but the the Compara family that is expanded in cestodes also contain the Fc-binding S. mansoni Sm23 and S. japonicum Sj23 proteins. We also see an expansion of a second, divergent tetraspanin Compara family in trematodes (family 730503; ~3-5 copies per Schistosoma species; 11 Clonorchis sinensis; 3 Fasciola hepatica; Extended Data Fig. 3d and Supplementary Table 9a). Since lipids act as effectors and second messengers in the vertebrate 253, interactions with host fatty acids and retinoids may have immunomodulatory effects. In strongylids (clades Va, Vb, Vc) we saw an expansion of the fatty acid- and retinoid-binding protein (FAR) family (family 179875; 75% predicted secreted; 14-18 copies per Ancylostoma species; 8-10 Haemonchus, compared to 6 C. elegans, 2 Globodera pallida; Extended Data Fig. 3e and Supplementary Table 9a) that has previously been studied in plant parasites27. FAR proteins are secreted by adult Ancylostoma254 as well as filaria255, and FAR genes are highly transcribed in adult Haemonchus and Necator28,256. Their role is unclear, but this Compara family also includes the Globodera gene gp-far-1 (also called sec-2257), encoding a surface protein that binds two lipids (linolenic and linoleic acids) that are precursors of plant defence compounds and the jasmonic acid defence signalling pathway27. Thus, secreted strongylid members of this family may play a role in eluding vertebrate host defences; although other possible roles include acquisition of lipids from the hosts, or delivering or sequestering signalling lipids and so modifying host tissues in which the adult parasites live28. Platyhelminths and nematodes have large numbers of glycosyl hydrolases related to degradation of glycosaminoglycans, including GH20, GH27, GH56, GH84 and GH85 class enzymes (Supplementary Fig. 23; Supplementary Table 27). These enzymes degrade hyaluronic acid, chondroitin and N-acetyl glucosamine, which give structural integrity to animal tissues258-260. A

63 chondroitin hydrolase (GH56; EC 3.2.1.35) family is expanded in nematode clade Vc (Ancylostomatidae + ) and with relatively high counts in other gastrointestinal parasites (family 117102; 48% predicted secreted; 15-21 copies in Ancylostoma species, 8 , 6 Strongyloides stercoralis; Extended Data Fig. 3f; Supplementary Fig. 23 and Supplementary Table 9a). This family includes C. elegans chhy-1, a chondroitin hydrolase which is orthologous to human HYAL1 and other , and which cleaves chondroitin and chondroitin sulphates261. These enzymes may play a role in breakdown of host intestinal walls by the parasites. We also observed a small expansion of an uncharacterised GH5-class glycosyl hydrolase in schistosomatids (family 627564; 56% predicted secreted; 7 S. mansoni; Extended Data Fig. 3g; Supplementary Fig. 23 and Supplementary Table 9a). The glycosyl hydrolase GH5 class includes a variety of specificities, some (e.g. β-glucosidase) have previously been reported in the excretory-secretory products of trematodes (e.g. Fasciola), and may be involved in degrading supramucosal gels found on host epithelial surfaces262. This Compara family (627564) includes a S. haematobium gene (A_02570) and a S. mansoni gene (Smp_187070) whose expression is highly enriched in eggs31,32. An intriguing possibility is that this family is used by schistosomatid eggs to traverse the bladder or intestinal walls, depending on schistosome species, in order to be passed into the environment. Alternatively, these transcripts may be present in eggs, but translated at a subsequent stage of the life cycle, for example for miracidial remodelling or penetrating the snail. Some helminth parasites such as Trichinella and Trichuris can live as intracellular parasites within host cells263,264. In clade I, which includes these species, there has been an expansion of a family with the PAN/Apple domain, involved in carbohydrate/protein binding (family 211141; 64% predicted secreted; 5 T. spiralis, 4 T. muris, 6 Romanomermis; 2 Soboliphyme, compared to 2 C. elegans, 2 B. malayi, 2 S. ratti, 3 A. suum; Extended Data Fig. 3h and Supplementary Table 9a). Genes with the PAN/Apple domain are upregulated in the T. muris parasite larva compared to the adult53. One possibility is that these proteins play a role in host cell attachment and invasion in Trichinella and Trichuris, since in the intracellular protozoan parasite Toxoplasma gondii, a PAN/Apple domain protein binds host galactose-containing ligands and may enable T. gondii to attach to vertebrate host cells, facilitating host cell invasion33. Alternatively the family may modulate galectin-mediated activation of the host immune system, as has been proposed in T. gondii265. Gene family expansions related to the helminth surface coat and protective glycocalyx. The surface coats of helminths (tegument or cuticle) offers protection from harsh environmental conditions and microbial pathogens, as well as from the host immune system. For example, larvae of Echinococcus tapeworms are protected by a thick layer comprised of mucins bearing galactose- rich carbohydrates266, while the surface of Taenia oncospheres (larvae) are covered with carbohydrate chains rich in D-mannose, D-, D-galactose and N-acetyl-D-galactosamine267. A total of 9,775 glycosyltransferases which belong to a wide range of GT classes (42 classes) were identified in our analysis (Supplementary Table 27; Supplementary Fig. 23), suggesting these species have an ability to synthesise a wide range of glycans. We observed a large expansion of a galactosyltransferase (GT31) family in cestodes, previously reported in Spirometra215 and cyclophillidean tapeworms219 (family 100585; 36% predicted secreted; 20-30 copies per species; Extended Data Fig. 4a) and another GT31 family expansion in nematode clade Vc (family 540548; 7-20 copies in Ancylostoma species; Extended Data Fig.

64 4b and Supplementary Table 9a). In nematode clade IVa, and part of clade IVb (Bursaphelenchus, Panagrellus), a GT49 glycosyltransferase family is expanded (family 140262; 36% predicted secreted; 11-18 copies per species; Extended Data Fig. 4c and Supplementary Table 9a). The outer layer of the cuticle surface of many species of nematodes is covered by a carbohydrate-rich glycocalyx or surface coat29. Microbial pathogens of nematodes bind to glycosylated proteins of the glycocalyx, such as mucins 268, while for parasitic nematodes the glycocalyx and ES represent the main immunogenic challenge to the host29. Expansion of the glycosyltransferase families may allow continual modification of the glycocalyx; it has been suggested that this may contribute to increased resistance of nematodes to microbial pathogens268. On the other hand, antigenic variation of surface proteins on successive parasitic stages during infection may help prevent immune recognition by hosts29. Alternatively, these enzymes may enable them to produce host-like glycans, that interact with host proteins such as lectins and thereby modulate the host immune response30. Another possibility is that some of these families are involved in glycosylation of secreted adhesion molecules, such as those of Strongyloides species269 that enable parasites to adhere to host epithelia. Helminths are protected by a tough surface coat, the flatworm tegument or nematode cuticle, while nematode eggs are partly composed of the tough and resilient polysaccharide chitin. Glycosyl hydrolases related to chitin metabolism (classes GH18 and GH19), as well as chitin-binding enzymes (CBM14) are enriched in nematode species and only a few are found in trematodes and cestodes, probably reflecting the fact that chitin is one of the structural polysaccharides of nematode bodies and eggs (Supplementary Table 27; Supplementary Fig. 23). For example, a chitin-binding (CB14) Compara family has expanded in Ascaris and Toxocara (family 785119; 62% predicted secreted; 8-10 copies in Ascaris species, 13 in Toxocara; Extended Data Fig. 4d and Supplementary Table 9a), perhaps related to the particularly thick chitinous layer of Ascaris egg shells270, which may underlie their ability to persist in the soil up to ten years271. Analysis of synapomorphic gene families (Supplementary Information: Results 2.2) recovered one chitin binding family restricted Clade III and four additional families specific to clade IIIa (Oxyuridomorpha). Trichocephalida (which includes Trichuris and Trichinella) also exhibited a synapomorphic chitin binding family, in addition to a GH18 family.

Expansions of helminth immunity- and development-related families Expansions of immune system gene families. We found several expansions of immunity- and development-related gene families in parasitic helminths that are probably unrelated to their parasitic life-style. Like free-living species, parasites encounter other pathogens such as bacteria, fungi and viruses both within the host and during free-living stages, and genes involved in innate immunity are likely to be under strong selection. As mentioned above, it has been suggested that glycosyl transferases may contribute to resistance of nematodes to microbial pathogens, by allowing them to continually modify their surface coat268. In clade IVa nematodes, we see expansion of a GT31 galactosyltransferase family that includes C. elegans bus-4, which was implicated in protection against bacterial infections34 (family 47681; EC 2.4.1.122, 18-37 copies per species; Supplementary Fig. 3a and Supplementary Table 9a), although it is of course possible that genes in this family have diverse functions. We also see expansion in clade IV of a family containing the C. elegans infection response gene irg-3, which is induced in C. elegans upon

65 infection with pathogenic Pseudomonas aeruginosa (see Table 2 in35; family 781277; 75% predicted secreted; 3-8 per Strongyloides species; Supplementary Fig. 3b). In nematodes clades Va and Vc, we observed an expansion of a lysozyme (GH22) gene family (family 95037; 68% predicted secreted; 19-31 copies in Ancylostoma species, 7-13 in Haemonchus species; Supplementary Fig. 3c and Supplementary Table 9a). Lysozyme damages bacterial cell walls by degrading the glycosidic bond linking N-acetylglucosamine and N- acetylmuramic acid in the proteoglycan of cell walls36, and the C. elegans genes in this family (lys- 1, lys-2, lys-3, lys-7, lys-8, lys-9) have been implicated in resistance to infection by certain bacteria272-275. The excretory/secretory products of and Heligmosomoides bakeri include lysozyme-like proteins similar to lys-1, lys-2, lys-3, lys-7 and lys-8276, and Hewitson et al36 suggested that these lysozymes modify the bacterial populations cohabiting with the parasites in the host intestine. In clades Va and Vc we also see a small expansion of a haem peroxidase family containing C. elegans bli-3 and duox-2, both of which are involved in collagen/cuticle biosynthesis wherein they function in dityrosine cross-linking of collagen, and so are important for cuticular integrity; but bli-3 has also been implicated in generation of reactive oxygen species as part of an innate immunity mechanism that protects against fungal and bacterial infection37,277 (family 184998; EC 1.11.1.7/1.6.3.1; 3–7 per Haemonchus species, 10 T. circumcincta, 7 A. caninum, 6 S. vulgaris, 6 O. dentatum; Supplementary Fig. 3d and Supplementary Table 9a). Expansions of developmental gene families. One of the fascinating features of helminth development is chromatin diminution in nematodes: in early embryonic cells, some of the chromatin is deleted during each cell division, except in cells set aside to become future gonads. In clade IIIb, it has been observed in Ascaris suum and Ascaris lumbricoides226, P. equorum230, and T. canis278. In clade IIIb we see expansion of a gene family that contains orthologs of the univalens coiled-coiled protein PUMA (family 114493; 34% predicted secreted; 16–40 per Ascaris species; 18 P. equorum, 25 T. canis, 34 A. simplex; Supplementary Fig. 3e and Supplementary Table 9a). During mitosis in Parascaris, PUMA is localised to the centrosomes, kinetochore-microtubules, and continuous centromeric region of the holocentric chromosomes38, but its molecular function is unknown. The genes in this family have the Pfam domain (PF15035), so-called after the human Rootletin protein which is required for centrosome cohesion before mitosis279. The one C. elegans gene in the family, lfi-1, is expressed in an area surrounding the kinetochore-microtubules during metaphase280. We conjecture that in C. elegans and most nematodes this gene family plays a role in kinetochore and/or kinetochore-microtubule function, and that expansion of this family in clade IIIb is somehow linked to chromatin diminution, since segments of chromatin that are to be eliminated lack any kinetochore230. Curiously, it is also expanded in Anisakis, not known to undergo diminution. Helminths undergo developmental changes such as moulting throughout their life cycle, including some specific to certain clades. For example, L1s of clade IVa species can either develop in a ‘direct’ route via an infective L3, or an ‘indirect’ route via a free-living L344, while Bursaphelenchus xylophilus is the only parasitic nematode known to form dauers281,282. In clade IVa, and in Bursaphelenchus, we see expansion of a family with the ‘ecdysteroid kinase’ Pfam domain (family 147797; 7-20 per Strongyloides species; 34 Bursaphelenchus; Supplementary Fig. 3f and Supplementary Table 9a). This family is absent from C. elegans. In insects, ecdysteroids play a

66 central role in controlling the onset of moulting, metamorphosis and reproduction283, and after their synthesis, ecdysteroids are phosphorylated by ecdysteroid kinases to an inactive form 284. Ecdysteroids may play a role in regulating developmental events such as moulting in some (but probably not all) nematodes285, and some (but not all) have orthologs of ecdysteroid receptors286- 289. This ecdysteroid kinase Compara family may possibly regulate the activity of other steroid hormones in clade IVa and Bursaphelenchus, such as the dafachronic acids that regulate dauer entry in Bursaphelenchus290 and the direct/indirect developmental switch in Strongyloides stercoralis40. Expansion of this family may have been in involved in evolution of these developmental choices.

Inexplicable expansions There are several expansions for which we cannot propose a plausible explanation based on known biology. For instance, in trematodes, we identified an expansion of an Xeroderma pigmentosum group G (XPG) family, particularly in schistosomatids (family 437828; 9 S. mansoni, 9 S. haematobium, 4 Clonorchis; 3 F. hepatica; Supplementary Fig. 24 and Supplementary Table 9a). Human XPG (also called XPGC or Rad2 or ERCC5) is a key enzyme involved in nucleotide excision repair of DNA291-293, so it is tempting to speculate that expansion of this family might enable schistosomatids to repair DNA damage during the rapid asexual replication of sporocyst stage parasites within a snail host. Another possibility is that this family may be involved in repair of DNA damage incurred during the free-living stages of their life cycle (including cercariae, which are exposed to UV light prior to finding a host)294-296.

2.4 Hypothetical genes and families Authors: Avril Coghlan and Adam Reid The mean percentage of ‘hypotheticals’ (genes to which we could not assign a protein name; see Functional annotation in Supplementary Methods) per species was 37.0% across the 81 helminths (5-95% range 24.3–56.2%). We expect that hypothetical proteins are more likely to represent false positive gene predictions, which would not be conserved across species. When we classified Compara families as ‘hypothetical’ if ≥90% of the genes in the family had the product description ‘hypothetical protein’, 51,356 (46.9%) families were found to be ‘hypothetical’. That is, although only one third of genes are annotated as hypothetical, nearly half of families contain mostly hypothetical proteins. This is at least partly explained by the finding that hypothetical families tend to be smaller (P < 2.2e-16; one-sided Kolmogorov-Sminov test; Supplementary Fig. 6a). There are two possible reasons for this: that some hypothetical families represent small collections of spurious gene models and that small (probably clade-specific) families are less well studied. While most are small, 5,260 hypothetical families have at least 10 members and 2,875 are found in at least 10 species (Supplementary Fig. 6b). Among the families that we identified as expanded in parasitic helminths (see Gene family expansions above), 9/27 strikingly expanded families and 23/141 moderately expanded families

67 were uncharacterised, and composed of hypothetical proteins (Supplementary Table 9a; Supplementary Fig. 7). Most groups of parasites possessed families in this list, suggesting a wide range of new biology is waiting to be uncovered. However, some of these may be difficult-to- identify transposon-related sequence families, something we have seen for many highly variable families. Most of these families lack signal peptides and transmembrane domains with the exception of family 393312, which contains 96 members (although not C. elegans), 25 of which are contain signal peptides and no transmembrane domains, suggesting they may be secreted. The family is strikingly expanded in V-Strongylid-other (clade Va), particularly Teladorsagia (26), Haemonchus (22) and Oesophagostomum (13) (Supplementary Fig. 7g). In this family, members with signal peptides tended to occur in T. circumcincta, although some were also in Oesophagostomum, but none were in Haemonchus. No members of this family were predicted to contain transmembrane domains, suggesting this family could be secreted outside of cells. In addition to expansions of uncharacterised families, a number of large expansions in poorly characterised gene families were observed (9 striking and 105 moderate expansions of poorly characterised families; Supplementary Table 9a). For example, in trematodes there is a striking expansion of a sulfotransferase gene family containing S. mansoni gene Smp_089320, which activates the pro-drug that is used to treat (family 219111; 9-16 copies in trematodes versus 1-6 in cestodes; Supplementary Fig. 7j). Sulfonation of oxamniquine by Smp_089320 produces a reactive product capable of alkylating DNA, proteins and other macromolecules42; however, the native function of Smp_089320 and its orthologs in other trematodes (which is presumably unrelated to oxamniquine) is not known.

2.5 GO Annotation Enrichment Authors: Avril Coghlan, James Cotton and Bruce A. Rosa GO terms were transferred to helminth proteins from human, zebrafish, C. elegans and D. melanogaster orthologs. Enrichment of GO terms or InterPro and Pfam domains in species groups of interest (e.g. clade III, Ascaridomorpha, etc.; see ‘Analysis group’ and ‘Broader analysis group’ in Supplementary Table 3) was tested, compared to a background of all species. This enabled us to identify some domains that are clade-specific or nearly so. For example, we identified only one InterPro Domain (IPR000791, acetate transporter GPR1/FUN34/SatP) which is present in all seven clade I nematode species, and none of the other species (P=0.005; Supplementary Table 26). This domain is found in 3 Romanomermis proteins, 4 from Soboliphyme, and 3 per Trichinella species and 3-4 per Trichuris species. Most of these proteins belong to one Compara family (family 972857), which only has clade I genes. This family was also identified in our analysis of synapomorphies (clade-specific families; Supplementary Information: Results 2.2). Proteins containing the IPR000791 domain have roles in acetate/succinate permeability/transport in fungi and E. coli297,298. In Chlamydomonas proteins containing this domain are upregulated in response to a boost of acetate in the medium299. Therefore, these proteins may be used by clade I species for acetate or succinate uptake, or efflux. For example, malate dismutation (which produces succinate and/or acetate) has been observed in clade I species (e.g. Trichuris vulpis300), so these transporters may possibly be used to remove acetate/succinate from malate dismutation that has built up. On the other hand, they may have a different ligand; some members of the

68 GPR1/FUN34/SatP family have been suggested to be ammonium/H+ antiporters in yeast301. The IPR000791-domain is not unusually found in animals, so it seems likely that the clade I IPR000791 genes were gained by horizontal transfer from Fungi or Bacteria. Indeed, taking one of the T. muris genes in this family (TMUE_s0096004700), we found its top non-clade-I BLASTP hits in the NCBI protein database were to a GPR1/FUN34/SatP family protein in the bacterium Azoarcus sp. KH32C (accession WP_015437673.1; 46% identity, E-value 4e-50), a Betaproteobacterium. The second top non-clade-I BLASTP hit was also to a Betaproteobacterium (Rhodoferax, 45% id.), but other of the top five non-clade-I BLAST hits were to Gammaproteobacteria (Amphritea, 45% id.) and alphaprotebacteria (Pararhodospirillum, 45% id.) (Supplementary Fig. 14b). Interestingly, the Chlamydomonas IPR000791-domain genes seem to to have been independently gained from Bacteria299. Phylogenetic analysis of the nematode proteins (Supplementary Fig. 14b) confirmed that the clade I proteins form a monophyletic group distinct from either the Chlamydomonas hits or those from the proteobacterial hits. This analysis thus failed to confirm the origin of these nematode proteins from any specific bacterial group, possibly because the nematode proteins are too divergent from any prokaryotic sequences.

The GTP cyclohydrolase I feedback regulatory protein (IPR009112) was among the InterPro domains most significantly expanded in clade V species (P=1x10-6), and was identified only in all clade V species, 7 outgroup species and one clade I nematode (R. culicivorax) (Supplementary Table 26). GTP cyclohydrolase I feedback regulatory protein (GFRP) regulate GTP cyclohydrolase I and thereby synthesis of the cofactor tetrahydrobiopterin, which is required for hydroxylation of aromatic amino acids302 and in biosynthesis of serotonin, and has been shown to required for cuticle integrity and biogenic amine synthesis in C. elegans303. Genes with this domain include C. elegans gfrp-1, which belongs to a Compara family that only has clade V and outgroup species (family 1049829) containing human GTP cyclohydrolase I feedback regulator (GCHFR). Drosophila appears to have lost the GFRP gene, instead using a different mechanism to regulate GTP cyclohydrolase I304. Thus, it appears that most nematodes (except clade V) and platyhelminths have also lost the GFRP gene, so may also have a different mechanism for regulating tetrahydrobiopterin synthesis.

2.6 Species Tree Author: James Cotton A whole-genome maximum likelihood phylogenetic tree for helminths and ten members of other animal phyla was constructed based on 202 gene families from our Compara database that were present in at least 25% of species, and were single-copy in every species in which they were found (Fig. 1, Extended Data Fig. 2). Within nematodes, our tree supports the grouping of clades III and V, with clade IV as a sister group to these two, although the bootstrap support is low (49%); in contrast to trees based on ribosomal RNA genes, and a previous tree based on 181 gene families, which suggested IV and V are sister groups, with clade III as their outgroup305. 1665 Compara gene families are shared by clade III and V nematodes but missing from clade IV (see Compara Database of Gene Families below) and a tree based on gene family presence/absence supports the III-V grouping, albeit only weakly (27% bootstrap support, Extended Data Fig. 2).

69 In a tree based on ribosomal RNA, the Spiruromorpha+other (clade IIIc) grouped with all but two Ascaridomorpha (clade IIIb), with Oxyuridomorpha (clade IIIa) as an outgroup, albeit without bootstrap support306. A second ribosomal RNA tree also agreed with this241. Our data set includes the first two genomes for oxyuridomorphs, making it possible to investigate this question. Consistent with the rRNA, our tree supports a grouping of Spiruromorpha+other and Ascaridomorpha, with Oxyuridomorpha as an outgroup (bootstrap 92%; Fig. 1). In addition, a tree based on gene family presence/absence groups Spiruromorpha+other with Ascaridomorpha, with Oxyuridomorpha as an outgroup (bootstrap 90%, Extended Data Fig. 2). Oxyuridomorphs and Ascaridomorphs do not use intermediate hosts, while many Spiruromorphs use hosts to transit from one definitive host to another305, so Ascaridomorpha and Oxyuridomorpha may have independently lost use of an intermediate host. Another interesting question is the placement of the genus Bursaphelenchus, which is grouped with Panagrellus in our tree (bootstrap 76%), as a sister group to Meloidogyne and Globodera (Fig. 1), and this is also supported by our tree based on gene family presence/absence (Extended Data Fig. 2, bootstrap 100%). Our tree is also consistent with a published rRNA tree, which grouped Bursaphelenchus with Panagrellus, with Strongyloides as an outgroup, although without bootstrap support and although the authors pointed out that GC content variation may bias the rRNA tree306. In contrast to our result, Bursaphelenchus, Meloidogyne and Globodera have previously been classified in the Tylenchomorpha, while the superfamily Panagrolaimoidea, to which Panagrellus belongs, was in the infraorder Panagrolaimomorpha, along with the Strongyloides species and other clade IVa species307 (Supplementary Table 3). Our bootstrap support for the placement of Bursaphelenchus with Panagrellus is relatively low (76%), raising the question of whether Bursaphelenchus was erroneously placed with Panagrellus in our tree, rather than within the Tylenchomorpha, as one would expect. The lungworm Dictyocaulus is grouped with the lungworm genus Angiostrongylus in our tree (with 100% bootstrap support; Fig. 1), and also in our tree based on gene family presence/absence (69%; Extended Data Fig. 2), in agreement with a previously published rRNA tree (bootstrap 57%306). This is surprising because Angiostrongylus is in the family Metastrongylidae, and Dictyocaulus is in Trichostrongylidae (Supplementary Table 3) but is not placed in our tree with other Trichostrongylidae such as Haemonchus and Nippostrongylus (clade Va; strongylid, other). This suggests that the family Trichostrongylidae is not monophyletic. Previous analyses of most single-gene datasets, and of complete mitochondrial genome data, supported either monophyly of cestodes and trematodes with monogeneans forming an outgroup to these two groups, or supported a clade of monogeneans and cestodes. The former arrangement suggests that complex, endoparasitic life-cycles evolved only once in Neodermatan platyhelminths (monogeneans, trematodes and cestodes)308, while the latter is traditional and supported by some morphological similarity between the hooked haptors of monogeneans and the onchospheral hooks of tapeworm larvae309. In our species tree, the monogenean Protopolystoma is grouped with the trematodes (bootstrap 90%; Fig. 1), but this relationship is not recovered in the tree based on gene family presence/absence (Extended Data Fig. 2); our gene content analysis shows Protopolystoma as sister group to both cestodes and trematodes. The only previous analysis of a genome-wide dataset for a monogenean parasite (Gyrodactylus salaris) also recovers this

70 relationship and conflicts with our sequence-based result310, although this is of a very distantly related monogenean, and it is not clear that Monogenea is a monophyletic group311,312.

3. Proteins historically targeted for drug development

3.1 SCP/TAPS protein family Authors: Huei-mien Ke, Tzu-hao Kuo, Tracy J. Lee and Isheng Jason Tsai The SCP/TAPS (Sperm-Coating Protein/Tpx/ 5/Pathogenesis) protein family (defined by a presence of Pfam CAP domain PF00188, or InterPro IPR014044) is ubiquitously found in both prokaryotes and eukaryotes41 and, having been characterised separately in various species, many alternative names (cysteine-rich secretory proteins (CRISPs), antigen 5 (Ag5), plant pathogenesis-related 1 proteins (PR-1), Sc7, Golgi-associated pathogenesis related (GAPR1), S. mansoni SmVAL). In addition to their wide range of fundamental roles in animal and plant basic biology41,313, SCP/TAPS proteins are often secreted314-316, possess immunomodulatory properties317-319 and have potential to be vaccine candidates320-323. There were a total of 3,915 putative SCP/TAPS genes present in the dataset of 81 helminths (Supplementary Table 10). After filtering by sequence length (Supplementary Information: Methods 23), 3,167 SCP/TAPS sequences from the 81 helminths (3,121 helminth sequences) and two outgroups (15 from human, 31 from D. melanogaster) were clustered by USEARCH140 using a threshold of ≥70% protein sequence similarity. Only 1,456 (46.0%) SCP/TAPS sequences could be placed into 498 clusters, suggesting high divergence of these genes. A maximum likelihood phylogeny was computed based on an alignment of the other 1,711 singleton sequences and the consensus sequences of the 498 clusters. The phylogeny suggested that SCP/TAPS comprise two major subfamilies as previously proposed313 (‘Group 1’ and ‘Group 2’ in Fig. 3a, Supplementary Fig. 5). SCP/TAPS expansions in N. americanus (120), Strongyloides and Parastrongyloides (40-179 copies in five species), Ancylostoma ceylanium (269), and S. mansoni (33) have previously been reported44,316,324,325. Here, cross-phyla comparisons again revealed several independent expansions, particularly in parasitic clade V (18-381 copies in 17 species) and parasitic clade IVa intestinal nematodes (39-166 copies in five species; Fig. 3b), with the highest, 381 copies, identified in Ancylostoma caninum. The tree suggested that SCP/TAPS expanded into two smaller subgroups within group 1, and into two smaller subgroups within group 2, since the common ancestor of IVa or strongylids (Fig. 3a). There were also exceptions, for example, was the only clade III nematode to have undergone expansion, with 66 copies. Of the 3,121 length-filtered SCP/TAPS genes in the 81 helminths, 79.4% were annotated with one single CAP domain, and no other domains. 10.4% (326) of genes were annotated with two SCP domains (Supplementary Fig. 25). The other 10.2% had other combinations (e.g. a single CAP domain plus some other domain(s)). The two-SCP/TAPS-domain genes were most commonly found in species with expansions. The two-SCP/TAPS-domain genes did not share a common

71 ancestor within or between different species and were found proximal to single-SCP/TAPS-domain genes, suggesting a fusion after speciation.

3.2 Proteases and Protease Inhibitors Authors: Avril Coghlan, Nancy Holroyd and Neil Rawlings Proteases and protease inhibitors (PI) have been linked to a number of different functions, including immunomodulation326, host tissue penetration43, modification of the host environment (e.g. anticoagulation)327 and digestion of blood328. High expression of proteases and PI in transcriptomic and proteomic data from parasitic stages of nematodes has been observed (e.g. astacins in Strongyoides44), as has expansion of some protease and PI families (e.g. chymotrypsin A-like serine proteases in Trichuris53). Proteases are secreted by helminths, particularly in parasitic stages44 and are therefore thought to be involved in host-parasite interactions. We investigated helminth proteases in several ways: through the MEROPS database163, and during analysis of expanded gene families (Gene family expansions: Supplementary Information: Results 2.3), novel domain combinations (Novel domain combinations: Supplementary Information: Results 3.7) and clade-specific families (Synapomorphies: Supplementary Information: Results 2.2). Proteases and PIs were annotated using BLAST searches against MEROPS (Supplementary Information: Methods 17, Supplementary Table 11). Twenty-three protease families were found in every platyhelminth and nematode genome (Supplementary Table 11a). Platyhelminths generally contained fewer genes that encode protease homologues than nematodes (medians 452 and 281; Supplementary Table 11a, Fig. 4b). Looking at the higher quality tier 1 genomes (those with high-quality assemblies; Supplementary Information: Methods 7), on average platyhelminths had 306 protease genes per species (5-95% quantile range 228 to 429), compared to 485 in nematodes (283 to 766). This agreed with our general finding of a lower number of enzyme annotations (EC numbers) per platyhelminth species compared to nematodes (Supplementary Information: Results 4.1). Clade V species had more proteases on average (629 in tier 1 species) than any other nematode clade.

Astacins The MEROPS protease family/subfamily with the most paralogues in any one genome, and the only family where there were more than a hundred paralogues predicted to be active was M12 ( metalloendopeptidases). Astacins have previously been found to be expanded in Strongyloides and Parastrongyloides44. The ability of a recombinant astacin-like metallopepeptidase from Ancylostoma to digest collagen45, and the presence of astacins in the excretory secretory products of Strongyloides infectious-stage larvae329 suggest that astacins are involved in parasite migration through host tissue. Clade IVa and clade V species had the most paralogues: for example, Parastrongyloides trichosuri (clade IVa, 331 paralogues) and (clade V, 163 paralogues). In our analysis of gene family expansions (Supplementary Information: Results 2.3), we identified six Compara families of astacins that had expanded, four of which had expanded in clade IVa, one in clade Vb (strongylid-lungworm) and one in Vc (hookworms) (Supplementary Table 9a;

72 Supplementary Fig. 8a-g). The family expanded in Vc (family 163808; 22 O. dentatum, 20 N. americanus, 13 A. duodenale, 12 H. bakeri, 11 S. vulgaris, 9 A. caninum, 8 A. ceylanicum) included A. caninum astacin mtp-2, which is secreted in adult ES products and is hypothesised to function in digestion of the host tissue plug lodged in the buccal capsule of the adult parasite330. Ancylostoma caninum astacin mtp-1, secreted by iL3s and involved in tissue migration45, belonged to a Compara family that has independently expanded in clades V and IVa, and was not detected by our analysis of gene family expansions (family 3: 11 T. circumcincta, 76 S. papillosus, 72 A. caninum, 53 H. bakeri, 41 H. contortus, 35 P. pacificus, 31 S. ratti). Strongyloides stercoralis ‘strongylastacin’, which is secreted by iL3s and hypothesised to be involved by larvae of S. stercoralis to penetrate human skin331, belonged to family 132363, one of the four families identified as expanded in clade IVa. In contrast to the nematodes, most platyhelminths had far fewer astacin homologues: Fasciola hepatica had only three (Fig. 4b). On average, platyhelminths had just 7 astacin homologues, compared to an average of 62 across nematodes (tier 1 species). This was consistent with the hypothesis that nematodes use astacins for skin invasion and migration through host tissue but that platyhelminths use different proteases for this purposes, for example serine proteases or cysteine proteases by schistosomes332.

Cathepsin proteases Cathepsin proteases, which are in the C1 () family, have previously been implicated in host- parasite interaction phenomena such as tissue invasion43, feeding333, and immune evasion334. The C1 family was one of 23 MEROPS protease families present in all helminth species. In our analysis of expanded gene families (Supplementary Information: Results 2.3), we identified two expansions of Compara families with the Pfam ‘Papain family ’ (PF00112) domain, both expanded in clades Va (strongylid-other) and Vc (hookworms): family 38173 (41 H. contortus, 40 O. dentatum, 33 T. circumcincta, 26 Strongylus vulgaris, 20 A. caninum; Supplementary Fig. 8h) and family 80624 (56 O. dentatum, 36 A. caninum, 25 S. vulgaris, 24 H. contortus, 19 T. circumcincta; Supplementary Fig. 8i). Compara family 38173 included human (CTSB), and had members in most helminths, including several known cathepsin B genes from C. elegans (e.g. cpr-6, cpr-8), H. contortus (e.g. AC-2335 and AC- 4336), Fasciola hepatica (e.g. FhcatB1334), and S. mansoni (e.g. Sm31337). The other Compara family (id. 80624) was found primarily in clade V, and did not include a human member, but did have several cathepsin B genes from C. elegans (cpr-1, cpr-2, cpr-3, cpr-4, cpr-5), H. contortus (e.g. gcp-7338 and hmcp-2339) and Ancylostoma caninum (e.g. CP-2340). There were few genes from flatworms in Compara family 80624 (e.g. 4 F. hepatica, 1 S. japonicum, no S. mansoni) but in family 38173 there were several (e.g. 11 F. hepatica, 10 S. japonicum, 9 T. regenti, 9 S. mansoni). This confirmed previous findings of a large number of cathepsin B genes in H. contortus (63 genes46), and marked expansions of cathepsin B genes in schistosomatids and fasciolids47. Note that while our two Compara families both had a mixture of C. elegans and Haemonchus cathepsin B genes, a phylogenetic analysis by Laing et al46 found that Haemonchus genes formed separate clades from those of C. elegans.

73 A striking association between these two Compara families and the haematophagous nature of some adult helminths was evident. The nematode species that were most abundant in the families are blood-feeding (Haemonchus in clade Va; Ancylostoma, Oesophagostomum, and Strongylus in clade Vc), as were the most abundant platyhelminth species (the flukes Fasciola, Schistosoma and Trichobilharzia; Supplementary Table 12). Many of the proteases in these families are most highly expressed in adult worms (e.g. H. contortus AC-2335, AC-4336, gcp-7338, hmcp-2339 and A. caninum CP-2340), and larval stages within the host (e.g. H. contortus cpr-6 homologues341, O. dentatum cathepsin Bs342) and are hypothesised to play a role in blood feeding328. For example, A. caninum CP-2 is part of a haemoglobin degradation pathway48, and, AC-4 is the main cysteine protease in -degrading extracts purified from adult H. contortus336. Likewise, S. mansoni Sm31 is expressed in adult worms337 and involved in degrading blood343; cathepsin Bs are part of a blood degradation cascade in S. mansoni49. In addition, it is likely that some of the cathepsin B proteases expressed in specific developmental stages may be involved in functions unrelated to blood feeding. F. hepatica FhcatB1 is thought to display a role in the digestive tract of newly excysted juvenile parasites344, while other F. hepatica cathepsin B proteases may be involved in metacercariae excystment, invasion of the host51,345, penetration of the host intestinal wall346, and immune evasion by inducing degradation of immunoglobulin heavy chain334. In nematodes cathepsin Bs may also play other roles besides blood feeding; for example, a cathepsin B might be involved in larval development, and possibly emergence of the nematode Parelaphostrongylus from its intermediate snail host50.

Trypsin proteases, prolyl oligopeptidases, and aminopeptidases In addition to the expansions of astacins and described above, our analysis of Compara family expansions (Supplementary Information: Results 2.3) also detected the previously described expansions of prolyl oligopeptidases in clade IVa44, trypsin proteases in clade I53, and aminopeptidases in cestodes215. Trypsin is secreted from the intestinal infective larvae of Trichinella spiralis347. Trichuris muris adults strongly express trypsin domain proteins in larvae, and in the anterior region of adults, and most are probably secreted53. Trap et al348 suggested that Trichinella spiralis trypsin domain genes may be involved in proteolysis of cuticle proteins during larval moulting, and/or digestion of host tissue for nutrition. The trypsin proteases are serine proteases (MEROPS family S1) that showed an expected large expansion in clade I, with the exception of Romanomermis that unlike the other clade I species sampled, is a parasite of insects rather than vertebrates (Fig. 4b; Supplementary Table 12). A proteomic analysis found an abundant prolyl oligopeptidase that is specific to the excretory and secretory (ES) products of female S. ratti worms329. Prolyl oligopeptidases (MEROPS family S9) showed a large expansion in clade IVa (Fig. 4b). A previously observed expansion of MEROPS family M17 (amino-terminal leucyl aminopeptidases) in the Spirometra genome215 was also observed in the cestodes Dibothriocephalus and Mesocestoides (Fig. 4b). The expansion could relate to the complex life cycles of these cestodes that include the stages for colonising secondary intermediate hosts.

74 Protease inhibitors There were only four MEROPS inhibitor families found in all helminth proteomes (Supplementary Table 11b): I1 (Kazal), I2 (Kunitz-BPTI; see below), I4 () and I93 (sizzled), which inhibit a range of proteases (e.g. serine proteases, cysteine proteases, metalloproteases). There were several MEROPS inhibitor families commonly present in nematodes but rare or missing in platyhelminths (e.g. I17 (see below), I12, I19, I31, I33, I35 and I83). In contrast, there were no MEROPS protease inhibitor families prevalent in platyhelminths that were rare or absent in nematodes (Supplementary Table 11b; Fig. 4b).

Trypsin, trypsin-like, and chymotrypsin/elastase protease inhibitors Different families of trypsin inhibitors can be used to protect helminths from degradation by host proteases, to facilitate feeding and to manipulate the host response to the parasite52. I2 inhibitors. The most abundant protease inhibitor across all parasitic nematodes and platyhelminths was I2 (Kunitz-BPTI), a trypsin inhibitor present in all helminth proteomes (Fig. 4b). I2 inhibitors inhibit proteases of the S1 family, which includes trypsin and chymotrypsin. Nematodes had an average of 102 I2 members per species, whilst platyhelminths had 38 on average. Kunitz-type (I2) inhibitors were previously described as expanded in cestodes and especially Spirometra215. In agreement with this, we found an expanded Compara family of Kunitz-type inhibitors in cestodes (family id. 576282; 10 per Echinococcus species; 6-9 per Taenia species; 9 Spirometra; Supplementary Fig. 8n; Supplementary Table 9a; Supplementary Information: Results 2.3). Although MEROPS family I2 was found in all flatworms, it was most abundant in cestodes (Fig. 4b). A putative diverged trypsin inhibitor Compara family (id. 616925), distantly related to Kunitz-BPTI inhibitors, was present and expanded only in clade Va and Vc species, and absent from lungworms (14 H. bakeri, 8 H. contortus, 8 A. caninum, 7 S. vulgaris, 6 T. circumcincta; Supplementary Fig. 8j; Supplementary Table 9a). The family had low sequence similarity to Kunitz/BPTI; 11 genes in the family had weak BLASTP hits in the NCBI protein database to an Ancylostoma protein annotated as ‘Kunitz/Bovine pancreatic trypsin inhibitor domain protein’ (accession EPB75542.1; e.g. BLASTP hit from ANCCAN_24592 had 28% identity, E-value 7e-13). No known MEROPS PI family was identified for this family, suggesting these could be a novel protease inhibitor specific to clade V parasites (other than lungworms). I8 inhibitors. Many trypsin-like inhibitor (TIL) genes are upregulated in the parasitic females of Strongyloides ratti44. Consistent with this, in clade IVa we identified an expanded Compara family of trypsin inhibitor-like genes (id. 426447; 17 P. trichosuri, 9-14 per Strongyloides species; Supplementary Table 9a; Supplementary Fig. 8l). Genes in this Compara family (e.g. SRAE_1000260200) were classified in the MEROPS ‘chymotrypsin/elastase inhibitor’ family (I8), which inhibits trypsins, chymotrypsins and elastases and was abundant in clades IVa and Vc, and to a lesser extent in clade IIIb (Fig. 4b).

75 In clade IIIb we also observed an expanded chymotrypsin/elastase (I8) inhibitor Compara family (family 754923; 8 per Ascaris species, 3 Toxocara). Chymotrypsin/elastase inhibitors from Ascaris are thought to protect it from host proteases55 (Supplementary Fig. 8m).

Other protease inhibitors I17 inhibitors. The previously reported expansion of the protease inhibitor family I17 (WAP-type inhibitors such as antileukoprotease, inhibitors of trypsin-like serine endopeptidases) in clade I53 was even more striking in this larger data set (Fig. 4b). I17 was commonly present in nematodes but rare or missing in platyhelminths. a2M inhibitors. Among our Compara family expansions (Supplementary Information: Results 2.3), one of the most prominent expansions observed was a protease inhibitor (MEROPS family I39), α-2-macroglobulin (a2M) homologue (Compara family 358015). MEROPS family I39 members interact with a wide variety of endopeptidases such as aspartic and serine proteases. This Compara family was present in all platyhelminth species (but not nematodes), and a significant expansion was observed in cestodes, particularly (15 S. erinaceieuropaei, 15 D. latus, 6-8 per Taenia species, 3 per Echinococcus species, 2-13 per Hymenolepis species; Supplementary Table 9a, Supplementary Fig. 8k). The only C. elegans gene with the Pfam α-2-macroglobulin domain (PF00207), tep-1, was present in a different Compara family (172243), with many nematode members with 1-6 copies per species. In agreement with this, the MEROPS family I39 (α-2-macroglobulin, or a2M) was present in most helminths, including nematodes, but most abundant in cestodes (Fig. 4b). In flatworms, some α-2- macroglobulin genes were found to be highly expressed in O. viverrini349 stages within the biliary tract of the host. Vertebrate α-2-macroglobulin is known to inhibit coagulation by inhibiting thrombin350, and in Opisthorchis, reduction of clotting at attachment or feeding sites351 may be aided by this gene. α-2-macroglobulin has also been shown to bind to several important cytokines and hormones54. This expanded family could therefore potentially be important in immunomodulation, by reducing the activity of defensive signalling proteins from the host.

3.3 GPCRs Authors: Tim Day, Nicolas J. Wheeler and Mostafa Zamanian Known GPCRs from C. elegans, B. malayi, O. volvulus, S. mansoni and S. mediterranea (providing information from both parasitic and free-living flatworms/roundworms) were identified from literature mining and previous GO annotations and used as seeds for identification of GPCR families in our Compara database (Supplementary Information: Methods 24). Additional GPCR families were identified from our analysis of synapomorphic gene families, and manual curation, resulting in a total of 230 GPCR families, which included 5,939 genes from our 33 ‘tier 1’ (with high-quality assemblies; Supplementary Information; Methods 7) helminth species (Supplementary Table 15a, b, d). The number of GPCRs per species ranged from 58-252 (5-95% quantile) across the 33 tier 1 helminths. A heatmap of these data uncovered a number of trends (Extended Data Fig. 5). First, the massive radiation of chemosensory GPCRs in C. elegans was unmatched by any other nematode species.

76 Indeed, of the C. elegans complement of 1402 GPCRs, 87% were chemosensory GPCRs (Supplementary Table 15b). This did not seem to be due to better annotation of GPCRs in C. elegans compared to other species. By comparing to C. briggsae, it has been found that high GPCR count of C. elegans is due to expansion, rather than loss from other lineages217,352,353. After C. elegans, the next largest chemosensory fraction occurred in the parasitic nematodes Strongyloides papillosus (48%) and S. venezuelensis (47%), then in Parastrongyloides trichosuri (43%) and the free-living Rhabditophanes sp. KR3021 (43%; Supplementary Table 15b). All examined parasitic nematodes possessed chemosensory receptors but, after normalising for the total numbers of genes, the greatest numerical representation was apparent in clade IVa parasites (Extended Data Fig. 5). In particular, there were several synapomorphic chemoreceptor families in this clade (Supplementary Table 15a), which likely reflects the extensive range of environments with which these nematodes interact, given their life cycles that involve full development of free-living adults as well as multiple parasitic stages. Some chemoreceptor families were almost completely conserved across all parasites in the phylum. These included homologs of C. elegans daf-37, which is known to mediate ascaroside signalling, consistent with the importance of this small molecule signalling pathway in the likely evolutionary transition from dauer larvae to infective larvae354. Interestingly, no chemosensory GPCRs were readily identifiable in platyhelminth species, even though they are found in arthropods and vertebrates and diverged chemoreceptors are found in other phyla such as Molluscs56. Given that platyhelminths, like nematodes, must navigate diverse environmental cues, it is unlikely that these species do not engage in sophisticated forms of chemosensation. We hypothesise that either some subset of identified phylum-specific orphan receptors are involved in such processes, or that chemosensation in flatworm parasites is not primarily driven by GPCRs. A difference was seen in the GPCR complements of filarial nematodes and non-filarial nematodes (Extended Data Fig. 5). Filarial nematodes are nested within clade IIIc (Spiruromorpha+other), but phylogenetically diverged from their non-filarial counterparts (Fig. 1). O. volvulus, L. loa, B. malayi, and L. sigmodontis showed an overall decrease in GPCR families and, on average, possessed about 41% fewer GPCRs than other members of clade III among the tier 1 species (medians 103 versus 145). Within clade I, Trichinella spiralis, Trichuris muris and Soboliphyme baturini were similarly reduced in their GPCR complements (58, 49, 69 genes, respectively) compared to other non-filarial nematodes (Extended Data Fig. 5). In contrast, Romanomermis culicivorax appeared to contain a more standard GPCR complement (174 genes). An earlier but similar cluster analysis of the R. culicivorax genome225, reported differences in gene content between the whole proteomes of clade I species. The authors surmised that T. spiralis is not representative of the rest of the clade, but the present GPCR analysis suggested that in terms of GPCR content, it is actually R. culicivorax that differs more. Although many families are conserved in both phyla, we observed far fewer GPCRs in platyhelminths than in nematodes (medians 75 versus 182; 5-95% quantiles of 58-263 versus 60- 114, for tier 1 species; two-sided Wilcoxon test, P=0.005). Platyhelminths also had fewer GPCR Compara families (57 versus 194 containing tier 1 species). However, there were 21 families that

77 had members of both phyla. There were also a number of platyhelminth-specific groups, such as the PROFs (Platyhelminth-specific Rhodopsin-like Orphan Families). These receptors are Class A receptors and hypothesised to be peptide responsive but they do not show any significant homology to any annotated or deorphanised proteins174. Additionally, there were several other non- PROF GPCR families that were specific to flukes (trematodes), showing a clade-specific GPCR expansion (Extended Data Fig. 5).

3.4 Pentameric Ligand-Gated Ion Channels Authors: Robin Beech and Mostafa Zamanian The pentameric ligand-gated ion channels (pLGIC) mediate fast synaptic neuromuscular signalling and are the targets of many anthelmintic drugs355. Five subunits combine to form either homo- or hetero-pentameric channels. The family can be divided broadly into the activating, cationic channels that lead to membrane depolarisation and inhibitory anionic channels that lead to hyperpolarisation and further divided based on their principal neurotransmitter, including acetylcholine, choline, biogenic amines, glutamate, GABA and protons among others. The nematodes are particularly rich in pLGIC subunit genes, with the model nematode, C. elegans, encoding at least 29 cationic channel subunits and more than 35 anionic channel subunits356. All parasitic nematode species had members of previously described nematode acetylcholine receptor classes (deg-3, acr-16, unc-29, acr-8 and unc-38). We found both cationic and anionic receptor subunits had expanded in nematodes clades III, IV and V, following divergence from clade I (Supplementary Fig. 9; Supplementary Table 16). This suggested neurotransmitter signalling is particularly important and diverse, with expanded, derived channels regulating behavioural characteristics specific to the nematodes such as chemotaxis, feeding, movement and reproduction. In contrast, platyhelminths contained a more limited number of subunit genes that varied by class: acr-16-like and acr-26-like predominated in the flukes and unc-63-like in the tapeworms. The trematode, S. mansoni, encodes 27 different pLGIC subunits, four of which encode glutamate- gated anion channel subunits357, and at least five are believed to encode acetylcholine-gated anion channels58. A phylogenetic tree of all pLGIC channel subunits (Supplementary Fig. 10) placed these as a distinct clade, most closely related to nematode acr-26 and acr-27, that extended to all sequenced platyhelminths. Notably, an earlier, smaller phylogenetic analysis suggested that the trematode acetylcholine-gated anion channels were related to another nicotinic acetylcholine receptor, Clonorchis acr-16 58, but this earlier analysis was based on fewer genes and did not include acr-26/acr-27 representatives. S. mansoni acetylcholine-gated chloride channel genes in the previous analysis58 belonged to Compara families whose members were placed in the acr- 26/acr-27 clade of our phylogenetic tree (families 921476, 123484). One of the previously studied genes, Smp_142700, was not included in our final tree due to poor alignment of the small Compara family to which it belongs, but preliminary phylogenetic analyses showed that it fell within the acr-26/acr-27 clade. The apparent switch of some acetylcholine receptor subunits from cationic to anionic is not unprecedented. For instance, the EXP-1 GABA receptor in C. elegans (that controls defaecation) switched from anionic to cationic, most likely in the ecdysozoan ancestor of

78 the nematodes358,359. A similar end result is achieved in the nematodes with a switch in ligand specificity of an anionic channel to bind acetylcholine360. A reversal in acetylcholine receptors from cationic to anionic in the platyhelminths should have led to dis-regulation of a tightly coordinated system in any intermediate stage. It seems likely, therefore, that the role of acetylcholine in an ancestor was much reduced, with other signalling molecules, perhaps the biogenic amines361, taking on a more dominant role. The newly derived anionic acetylcholine receptor could then later acquire a more significant physiological role in the neuromusculature. Understanding how this could occur may be extremely illuminating for our understanding of how neuromuscular signalling has evolved.

3.5 ABC Transporters Author: Robin Beech The family of ATP dependent (ABC) transporters are multi-pass membrane proteins that actively transport a diverse range of substrates. Although they share structural similarities, their organisation falls into several distinct family categories362. The number of ABC transporter genes in nematodes was greater than vertebrates with 60 in C. elegans363 and 40 in S. mansoni (Supplementary Fig. 11; Supplementary Table 17), compared to 49 in humans364. For example, there were 14 functional p-glycoproteins in C. elegans (Supplementary Table 17) compared to four in humans60. The families appeared dynamic, with gene gain and loss events changing the complement of orthologues found in different species, while overall the numbers remained relatively constant. Known as drug transporters from their link with resistance to chemotherapy in cancer treatment, the biological roles of p-glycoproteins specifically in helminths are not well understood. In C. elegans many are expressed in the pharynx, intestine and amphids and are believed to be involved in lipid transport and protection from environmental toxins, heavy metals and infection60,363,365,366. Parasitic platyhelminths and nematodes spend a significant part of their lifecycle within a host. The P-glycoproteins (pgp) may play a direct role in protection from the host immune system, which may explain their abundance in helminth genomes367. There is particular interest that increased Pgp expression may lead to anthelmintic resistance. The evidence for this comes from the fact that inhibitors of Pgp activity reverse anthelmintic resistance368 and that expression of nematode and trematode Pgp in mammalian tissue culture cells leads to anthelmintic resistance369. This is a complex issue since different Pgps may act in concert and it is difficult to ascribe resistance to individual Pgp genes. In addition, the family appeared dynamic with loss of some Pgp copies and expansion of others46,92 (Supplementary Fig. 11), which means information gained on function in a model such as C. elegans may not be translated directly to an understanding in related parasitic species. The membrane phophatidylserine is normally limited to the internal leaflet of surface membranes. The ABC transporter ced-7 in C. elegans is involved in transport of phosphatidylserine to the outer membrane leaflet, where it signals cells for apoptotic body clearance370. Orthologs of this gene were found in the nematodes (Supplementary Fig. 11). Although distant homologues of ced-7 were found in a few platyhelminth species (Supplementary Fig. 11), they belonged to different Compara families than the nematode ced-7 genes, so were likely not ced-7 orthologues. The

79 schistosome membrane contains phosphatidylserine371 so signalling the membrane for apoptotic is presumably achieved through a different mechanism, or the flipping of phosphatidylserine may be achieved by other means.

3.6 The Kinomes of Nematodes and Platyhelminths Authors: John Martin and Bruce A. Rosa Nematode and platyhelminth species had an average of 251.1 total kinases per species (across all 81 species; see Supplementary Information: Methods 10; Supplementary Table 25). Tyrosine kinases (TK) (and -like, TKL, kinases), when combined, were the most abundant family (average 75.0 kinase proteins per species; Supplementary Fig. 26), as previously reported for C. elegans372, but these were variable between nematode clades (average 53.7 among clade I species, and 91.2 among clade V species). The largest difference in abundance between nematodes and platyhelminth species was the Casein Kinase (CK1) family, as previously found in a comparison of C. elegans and S. mansoni213, with an average of 46.9 genes among nematode species and only 12.7 among platyhelminth species. In platyhelminths, a few noticeable cestode-specific protein kinase Compara family expansions were observed, including a tyrosine kinase-like (TKL) family (id. 94187, which contains C. elegans src-1, involved in early embryo development 373) and an A,G&C (AGC) family (id. 110854, which contains C. elegans rsks-1, implicated in longevity374; Supplementary Fig. 26). The second family (110854) was also detected in our analysis of gene family expansions (Supplementary Table 9a; Supplementary Information: Results 2.3).

Two CK1 kinase familes (8185, which includes C. elegans ttbk-2 and spe-6, and 38771), and a TKL family (22454, which includes C. elegans spe-8 and frk-1), were highly abundant and universally conserved across nematodes (with the exception of family 38771 in clade IIIa), and were also universally absent across all Platyhelminthes species. These two CK1 families are both tau kinase-like families; in vertebrates tau tubulin kinases they control the phosphorylation state of the microtubule-associated protein Tau, thereby regulating its binding to neuronal microtubule networks375,376. However, spe-6 and spe-8 are involved in sperm activation in C. elegans377,378, and frk-1 in cell division during development379, so these CK1 and TKL families may have different roles in nematodes.

One ‘right open reading frame’ (RIO) kinase family (family 33836) was highly abundant among Strongyloides species (13-44 per Strongyloides species; 13 in C. elegans; Supplementary Fig. 26).

3.7 Novel domain combinations, including those related to proteases Authors: Avril Coghlan, Jane Lomax, Adam Reid and Myriam Shafie In line with our findings more generally across these species (see Proteases and Proteases Inhibitors, Supplementary Information: Results 3.2) we identified novel phylum-specific domain

80 combinations involving several protease domains, in both phyla (Supplementary Table 14). Nematode-specific protease-related domain combinations included the ShK domain-like and Carboxypeptidase activation peptide domain, EB module domain and Kunitz/Bovine pancreatic trypsin inhibitor domain, and the Kunitz/Bovine pancreatic trypsin inhibitor domain and EGF-like domain. These were all found in both parasitic and free-living nematodes, so are probably not related to parasitism. For example, the ShK-like and Carboxylpeptidase activation peptide domain combination was found in several C. elegans genes, including suro-1, which regulates cuticle formation and body morphogenesis380. We found four novel protease-related domain combinations specific to platyhelminths: the Ku70/Ku80 N-terminal alpha and beta domain with Prolyl oligopeptidase domain; Ankyrin repeats with the domain; I zinc metalloprotease (M18) domain with ‘helix-turn- helix, Psq’ domain; and Peptidase family M28 domain with Enhancer of rudimentary domain (Supplementary Table 14). These were not found in the free-living species Schmidtea mediterranea, so may possibly be related to parasitism. For example, the combination of M18 metalloprotease and ‘helix-turn-helix, Psq’ (a domain involved in DNA binding381) was found in 15 genes belonging to the same Compara family (id. 768925). This Compara family included S. mansoni Smp_210600 and Smp_210610, adjacent gene predictions that each have one of these two domains, and were recently reannotated as a single gene Smp_246800. This gene was shown to have significantly higher expression in 3-hour post-infection schistosomula compared to cercariae382, suggesting a potential role in regulating gene expression shortly after invasion.

4. Metabolic reconstructions of nematode and platyhelminth parasites

Authors: Avril Coghlan, James Cotton, John Parkinson, Swapna Lakshmipuram Seshadri and Rahul Tyagi

4.1 Overall metabolic potential

Number of EC numbers and enzymes Overall, 789 unique EC numbers were confidently ascribed to enzymes across the 81 genomes but there was a marked difference in the number that could be identified in nematodes versus platyhelminths (means 306 and 395 without hole filling, respectively; t-test P = 3.7x10-8; Supplementary Table 18a; Extended Data Fig. 6a). Two hundred of these ECs were nematode- specific (i.e. not detected in any flatworm) and 93 were platyhelminth-specific (Supplementary Fig. 19a,b). The total EC-number count increased to 850, when pathway hole-filling was used to increase annotation sensitivity for the 33 tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7). Even when less reliable EC annotations from Compara families were included (Supplementary Table 18b), the most contiguous flatworm genomes had >60 fewer unique EC numbers than similarly highly contiguous nematode genomes (e.g. S. mansoni 437 and E. multilocularis 404 versus S. ratti 544, T. muris 498, C. elegans 607). This may

81 be partly due to real biological differences (e.g. loss of enzymes from from certain platyhelminth clades), but also could be partly due to the greater number of nematodes compared with platyhelminths in this study and the rich biochemical knowledge about C. elegans and D. melanogaster, another ecdysozoan, facilitating enzyme prediction in nematodes over the more distantly related platyhelminths. As expected, the total proteome size (total amino acids for all coding sequences) was significantly correlated both with the total number of enzymes and the number of unique EC numbers (P < 10-5 and P = 3x10-5, respectively; Supplementary Fig. 19c,d). The extent to which KEGG helminth- relevant reference pathways (Supplementary Information: Methods 25) were covered in different taxonomic groups, defined as the fraction of ECs present (Supplementary Table 18e-f and Supplementary Fig. 12a), confirmed the apparently reduced metabolic potential of platyhelminths relative to nematodes: 74% of reference pathways had significantly lower coverage in platyhelminths. Amongst nematodes, many KEGG pathways (51%) appeared to have higher coverage among clade IVa parasites (Supplementary Fig. 12a). Four of the five clade IVa parasites in our analysis belong to the genus Strongyloides. This observation may partly reflect the biology of this genus (i.e. evolutionary loss of enzymes/pathways may be very rare in this genus), but also likely relates to their generally highly contiguous assemblies. In fact, when less reliable EC numbers were considered, assigned by hole-filling or inferred from other Compara family members (Supplementary Table 18a-b), the number of unique EC numbers in the tier 1 clade IVa species (with a high-quality assembly; Supplementary Information: Methods 7) S. ratti approached that of the well-studied model C. elegans (544 versus 607). In contrast, filarial nematodes (representing clade IIIc in Supplementary Fig. 12a) had a dramatically reduced metabolic coverage compared to other nematodes, with 48% of KEGG reference pathways having significantly lower coverage, due to partial or complete loss of specific pathways (e.g. absence of the glycine cleavage system; see Amino acid metabolism below). Even B. malayi and O. volvulus, both tier 1 species, had >50 fewer unique EC numbers than S. ratti (492 and 483 versus 544, including those from hole-filling and Compara families; Supplementary Table 18a-b).

Variability in KEGG pathway coverage In addition to their reduced KEGG pathway coverage, the pathways in platyhelminths showed significantly more variable coverage between species than seen between nematode species (Supplementary Fig. 12b, Wilcoxon test P=0.004). Amongst the platyhelminths, the flukes (trematodes) showed the highest variation in pathway coverage, whereas amongst the nematodes, clades IVb (which includes Panagrellus and the plant parasites) and IIIb had the most variation. This result did not change even when only a single species per genus was considered (data not shown), which suggests that the observed difference in variation is not entirely due to biases resulting from having multiple species from the same genus. Comparing the super-pathways using variation in pathway coverage across helminths, nucleotide metabolism pathways (only two pathways) and carbohydrate metabolism pathways showed the least variation in coverage while lipid and amino acid metabolism pathways showed much greater variation in coverage (Supplementary Fig. 12c). Some of these differences in pathway coverage variation may represent true biological differences (e.g. reflecting a wide range of lipid biochemistry), but others are likely to be because certain enzymes in these pathways or clades are more challenging to

82 annotate. Indeed, the high variability in pathway coverage in platyhelminths probably reflected a greater variability in assembly quality (which will have affected EC-number annotation) compared to nematodes: the interquartile ranges of two measures of assembly quality, assembly contiguity (N50/scaffold-count) and CEGMA (partial) score, were both greater in flatworms than in nematodes (470 versus 249, and 6.9 versus 2, respectively, for tier 1 species; Supplementary Table 1).

Coverage of KEGG modules Metabolic potential was also compared among helminths by taking pathway topology into account, by ascertaining the presence of a metabolic module based on annotation of all indispensable enzymatic steps, as previously described181. Clustering based on ‘complete’ pathway modules identified in each species, in which all pathway steps had to be present (Supplementary Table 18c-d), confirmed the observations based on the topology-agnostic KEGG pathway coverage results described above, while providing some additional observations included in the pathway- specific results (below). Clustering species based on metabolic module presence produced a tree generally consistent with the species tree (Supplementary Fig. 19e). Consistent with the common theme, platyhelminths tended to have reduced metabolic potential relative to nematodes in terms of the number of metabolic modules complete in the worms from the two phyla (mean 21.1 and 26.8 complete modules, respectively; t-test P<9e-4; species-wise counts of complete modules shown in Supplementary Fig. 19c,d).

4.2 Lipid metabolism Lipid metabolism pathways had the highest median coverage variability among helminths (Supplementary Fig. 12c) suggesting either that lipid metabolism genes are particularly hard to annotate (e.g. due to fast sequence divergence), or that there is variable sub-pathway representation due to different levels of dependence on lipids as food reserves. Lipids can be used as fuel through their degradation to fatty acids, and then to acetyl-CoA by the β-oxidation pathway. Lipids such as triglycerols are used as highly-concentrated energy reserves in nematodes from aerobic habitats, for example, during embryogenesis of Ascaris lumbricoides eggs270, and infective larvae and free-living adults of Strongyloides ratti383.

β-oxidation β-oxidation of fatty acids yields numerous ATP molecules when glucose intake is low. The KEGG β-oxidation module (M00087) was the only metabolic module that was complete in all nematodes, yet appeared incomplete in most platyhelminths, except for the fluke C. sinensis and the cestode S. solidus, based on the presence of at least one enzyme in the reaction cascade not being confirmed by our stringent annotation. This concurred with previous discovery of β-oxidation genes in the genomes of C. sinensis and its close relative Opisthorchis349, and was consistent with biochemical evidence for the β-oxidation genes in S. solidus (although they did not appear to oxidise exogenous palmitate)384. Furthermore, we were able to identify putative β-oxidation genes in additional flatworms (Supplementary Table 19a) using EC annotations from a less stringent approach based on Compara families (Supplementary Table 18b). These included enzymes for

83 all four steps of the β-oxidation pathway (acyl CoA dehydrogenase, enoyl CoA hydratase, L-3- hydroxyacyl CoA dehydrogenase, and β-ketothiolase) in the F. hepatica (consistent with biochemical evidence384), free-living flatworm S. mediterranea, monogenean P. xenopodis, and the cestode S. erinaceieuropaei. Candidates for some of the four steps were also found in the fluke E. caproni and cestode D. latus (Supplementary Table 19a). The β-oxidation module also appeared nearly complete (just missing one step) in (Supplementary Table 19a, Fig. 5a). Consistent with this, the Compara family containing the C. elegans carnitine palmitoyl transferase (cpt-1), which transports long chain fatty acids into the mitochondrion for β-oxidation, also contained members from C. sinensis, F. hepatica, S. mediterranea, P. xenopodis, E. caproni and D. latus. Curiously, we did not identify β-oxidation genes, or cpt-1 homologs, in several tier 1 flatworm species with highly contiguous genomes, including the trematode S. mansoni, and cyclophyllidean cestodes E. multilocularis and H. microstoma, despite biochemical evidence suggesting the related H. diminuta and Taenia crassiceps have β-oxidation enzymes67,385. In the case of S. mansoni, although putative β-oxidation genes were previously identified in the genome213, they belonged to Compara families that lack predicted β-oxidation genes from other species. However, experiments that are consistent with (but do not directly demonstrate) β- oxidation occurring during S. mansoni egg production have been reported66. Thus it is likely that S. mansoni and the cyclophyllidean tapeworms (Hymenolepis, Taenia, Echinococcus) have highly diverged β-oxidation genes; indeed, although they were absent (except for T. solium) from the Compara families containing β-oxidation genes from other species, they did have genes (in other Compara families) with some of the same Pfam domains that were in the β-oxidation genes of other species. Schistosomes and cyclophyllidean tapeworms live in the glucose-rich environments of the blood and small intestine, respectively, and have transporters for glucose uptake386,387, so they may have diverged to adapt to using glucose and glycogen as their main energy sources rather than lipids. Before they are degraded by β-oxidation, fatty acids are linked to coenzyme A to form acyl-CoAs (activated fatty acids) by acyl CoA synthetase (EC 6.2.1.3). We found a gene family of acyl CoA synthetases had expanded in clade IVa nematodes (family 150310; 7-10 copies per Strongyloides species versus 2 in C. elegans; Supplementary Fig. 13i). This was consistent with β-oxidation in infective larvae and free-living adults of Strongyloides ratti383. Also consistent with this was an expansion in clade IVa of a family containing C. elegans trcs-1 and human AADAC, a putative arylacetamide deacetylase and microsomal lipase, predicted to break down triglyceride to fatty acids (family 167895; 9-10 per Strongyloides species; Supplementary Fig. 13j).

Ketone bodies The acetyl CoA produced by β-oxidation, glycolysis or glycogen breakdown is usually fed into the . However, under starvation conditions, some tissues (in humans, the liver) start to convert oxaloacetate into glucose (gluconeogenesis), and there is not enough oxaloacetate available for acetyl CoA to enter the citric acid cycle. Under these conditions, acetyl CoA is converted into acetoacetate, D-3-hydroxybutyrate and acetone, three water-soluble compounds known as ‘ketone bodies’. These water-soluble compounds are transported to key tissues (in humans, the brain) for conversion back to acetyl CoA, which can be used there as fuel in the citric acid cycle. We found that the KEGG pathway for synthesis and degradation of ketone bodies had

84 significantly lower coverage in filaria compared with other nematodes (Supplementary Fig. 12a, Supplementary Table 18f). In particular, the hydroxymethylglutaryl CoA cleavage enzyme (EC 4.1.3.4), involved in conversion of acetyl CoA to acetoacetate, and 3-ketoacid CoA transferase (EC 2.8.3.5), involved in conversion of ketone bodies back to acetyl CoA, appeared to be missing from filarial species (but not their outgroup in clade III, Dracunculus; Supplementary Table 19b). Thus, filarial worms seemed to have lost the ability to make and use ketone bodies (Fig. 5b), perhaps because, unlike most other helminths studied here (except Trichinella and Bursaphelenchus; Supplementary Table 12), they are tissue parasites that do not have a free-living stage and are less likely to suffer periods of starvation.

4.3 The glyoxylate cycle: linking lipid and carbohydrate metabolism Absent from most metazoans other than nematodes62, the glyoxylate cycle (M00012; Fig. 5a) bypasses multiple decarboxylation steps of the citric acid cycle to directly cleave isocitrate into glyoxylate and succinate using the enzyme (ICL; EC 4.1.3.1). In a subsequent step, malate synthase (MS; EC 2.3.3.9) is used to synthesise malate from glyoxylate and acetyl CoA (Fig. 5b). The ICL/MS enzymes are encoded by a fusion gene in nematodes that phylogenetic analyses suggest was gained by the common ancestor of nematodes, from Bacteria62. In nematodes, the glyoxylate cycle provides a way to convert lipids via acetyl CoA into glucose (via gluconeogenesis) and other carbohydrates (e.g. trehalose), and is active during embryogenesis of Ascaris lumbricoides eggs, at a time when the rate of β-oxidation and carbohydrate (glycogen and trehalose) synthesis are also greatest270,388,389. Glycogen and trehalose can potentially serve as energy reserves required for larvae to migrate in the host body, or during anhydrobiosis390. In some free-living and plant parasitic species, it has also been proposed that the glyoxylate cycle may not function to convert lipid to carbohydrate, as in Ascaris eggs, but instead converts ethanol into glucose or glycogen391. We found the corresponding KEGG module (M00012) in most nematodes (with notable absence in many clade III nematodes, including many filaria), but not platyhelminths (Supplementary Table 18c; Fig. 5a), although biochemical evidence supports its presence in Fasciola392. ICL and MS, the two enzymes unique to this cycle, were absent from filarial species (although they have previously been detected biochemically in digitata393, which appears to have some unusual biochemistry but was not among the worms studied in this work) and absent from Trichinella species (Supplementary Table 19c). However, the glyoxylate cycle genes were found in Dracunculus, the outgroup of filaria within clade III. Both filaria and Trichinella are tissue parasites that lack free-living stages, so may not need the metabolic diversity required to survive starvation conditions, such as the ability to convert lipid to carbohydrate. Nematodes had an additional enzyme that flatworms lack: alanine-glyoxylate transaminase (EC 2.6.1.44) that converts glyoxylate and alanine to pyruvate and glycine (Supplementary Table 18a, Fig. 5b). In plants, some of the glyoxylate produced (from glycolate) during photorespiration is converted to glycine by alanine-glyoxylate transaminase, and some is oxidised by glycolate oxidase (EC 1.1.3.15) to oxalate394. Similarly, some bacteria convert glyoxylate from the glyoxylate cycle into glycine using alanine-glyoxylate transaminase395. Glycine (plus tetrahydrofolate, THF) can then be converted to 5,10-methylenetetrahydrofolate by the glycine cleavage system, to

85 provide a one-carbon pool for cellular biosynthesis (Fig. 5b). We found alanine-glyoxylate transaminase was absent from filaria (although present in their outgroup in clade III, Dracunculus) and Trichinella (Supplementary Table 18a), which was consistent with our finding that filaria and Trichinella had lost the glyoxylate cycle, and that filaria had also lost key genes required for the glycine cleavage system (see Amino acid Metabolism below). Thus, we propose that most nematodes, except filaria and Trichinella, convert some of the glyoxylate from the glyoxylate cycle to glycine, as input into the glycine cleavage system. Evidence suggests the glyoxylate cycle occurs in the mitochondria in nematodes396, while alanine-glyoxylate transaminase has been shown in vertebrates to be located in either mitochondria or peroxisomes397.

4.4 Carbohydrate metabolism

Scavenging carbohydrates from host food Intestinal helminths such as tapeworms, some ascarids, and some hookworms such as Oesophagostomum, are thought to obtain most nutrients from intestinal contents of their hosts (Supplementary Table 12). We found that a family of α-glucosidases (EC 3.2.1.20; cazyme class GH31), enzymes which break down starch and disaccharides (e.g. sucrose) to glucose, had independently expanded in clade IIIb (Ascaridomorpha) and clade Vc (family 69769; 51% predicted secreted; 19-31 copies per Ascaris species; 7-22 in Ancylostoma; Supplementary Fig. 13a). Excretory-secretory products of A. suum include many GH31 family glycosyl hydrolases, which includes α-glucosidases398. Thus, α-glucosidase genes may be secreted to enable the worms to break down host food and so use it as an energy source. Among hookworms, the most family members were found in Oesophagostomum (31), but it was also expanded in species thought to feed principally on blood (Supplementary Table 12) such as Ancylostoma species (7- 22 copies), suggesting that they also obtain some energy from host food.

Malate dismutation Many adult helminths live in low-oxygen conditions within host tissues or the host intestine and rely at least partially on anaerobic metabolism to produce ATP, including the malate dismutation pathway68,399-401. In this pathway, glycolysis converts carbohydrates to phosphoenolpyruvate (PEP) that is then carboxylated by phosphoenolpyruvate carboxykinase (PEPCK) to form oxaloacetate and subsequently reduced to malate. The malate is partly oxidised to acetate and partly reduced to succinate/propionate (Fig. 5b), which are excreted as end products68. We found that a PEPCK family had expanded in clade IIIb (Ascaridomorpha; family 146234; EC 4.1.1.32; 7-9 copies in Ascaris species versus 2 B. malayi, 2 O. volvulus; Supplementary Fig. 13c). The extra copies of PEPCK may possibly enable ascarids to channel phosphoenolpyruvate into the malate dismutation pathway, rather than into aerobic metabolism (citric acid cycle), as malate dismutation is thought to be especially important in Ascaris402,403. Alternatively, PEPCK is required by the glyoxylate cycle, which converts acetyl CoA to carbohydrates during embryogenesis of Ascaris eggs270. We also saw a small expansion of another enzyme involved in malate dismutation, methylmalonyl CoA epimerase (EC 5.1.99.1) in clade IIIb (family 586945; 2-4 genes per species in this clade but single-copy elsewhere; Supplementary Fig. 13e). In addition, an intracellular cobalamin (vitamin B12) trafficking chaperone gene family, containing C. elegans cblc-1 and human MMACHC404-406,

86 had expanded in clade IIIb (family 646695; 4 A. suum versus 1 C. elegans; Supplementary Fig. 13d). This may allow increased mitochondrial uptake of vitamin B12, which is required for malate dismutation in mitochondria400, and absorbed in large amounts by some helminths68,407. Surprisingly, we also found a CobQ/CbiP cobyric acid synthase family in clade IIIb (e.g. A. suum ASU_02377; also in the outgroups Amphimedon (sponge) and Capitella (annelid); family 1160166; EC 6.3.5.10), an amidating enzyme involved in cobalamin synthesis that converts adenosylcobyrinic acid to adenosylcobyric acid408. This family may have been gained by horizontal transfer from bacteria, since it had CbiA (Pfam PF01656) and GATase_3 (PF07685) domains, a combination almost exclusively found in bacteria and archaea. The top non-self BLASTP hit of the A. suum member of the family (ASU_02377) in the NCBI protein database was CobQ from the bacterium Bacteroides eggerthii (accession WP_004291065.1; 67% identity, E-value 0.0), and the top twelve non-self BLASTP hits were all to Bacteroides species proteins. Phylogenetic analysis confirmed that these ascarid proteins form a cluster within a clade of Bacteroidales sequences, and probably represent a genuine lateral gene transfer from this group (Supplementary Fig. 14a). The phylogeny confirmed that the other invertebrate sequences (from Amphimedon and Capitella) were related to a different bacterial group (gammaproteobacteria) and seemed likely to be contaminants rather than genuine lateral gene transfers. That is, the Amphimedon gene was present in the v1.0 assembly for that species used in our Compara database, but seemed to be deleted in v2.0. Both this gene and the Capitella sequence were on small contigs (~10 kb and ~7 kb, respectively) with only single-exon genes or genes annotated with only very short introns, and with strong BLAST hits to the genomes of a group of marine bacteria that are frequently found assocated with marine invertebrates. Bacteroides are common in the mammalian intestine where Ascaris is found409. This nematode enzyme probably has a different role in clade IIIb than in bacteria, perhaps in converting scavenged vitamin B12 and B12-like compounds to forms such as adenosylcobalamin that are useful for malate dismutation. Indeed, in some archaea, it appears that adenosylcobyric acid is the entry point for adenosylcobalamin salvage410, although this requires additional enzymes such as CbiB that we did not find in Ascaris. The A. suum member of the family (ASU_02377) was previously found to be overexpressed in the Ascaris intestine (as gene model GS_23025411), which would be consistent with a role in salvage from host tissue or host intestinal contents.

Duplications (defined here as EC numbers that are represented by ≥2 genes in ≥2 of the 81 nematode and platyhelminth species) were identified in the ‘glyoxylate and dicarboxylate’ KEGG pathway (which overlaps with malate dismutation; Supplementary Table 20j), with examples of paralogs for 92% its enzymes. For example, there were duplications in the malate dismutation enzyme propionyl-CoA carboxylase (EC 6.4.1.3), which converts D-methyl-malonyl CoA to propionyl CoA, in many nematodes and in trematodes (Supplementary Table 20j).

Propionate shunt As mentioned above, acetate (via acetyl CoA) and propionate are common end products of malate dismutation, but some helminths may re-metabolise a portion of these end products under certain conditions. In clade Va (strongylid-other), we saw an expansion of a family containing C. elegans

87 dehydrogenase alh-8, involved in a cobalamin-independent ‘propionate shunt’ pathway for converting propionate to acetyl CoA69 (family 476273; EC 1.2.1.18/1.2.1.27; 5-8 copies per Haemonchus species; 13 Teladorsagia circumcincta; Supplementary Fig. 13f, Fig. 5b). One possibility is that Haemonchus and close relatives may use this pathway to convert propionate from malate dismutation, or possibly residual unabsorbed propionate from bacterial fermentation in the ruminant host70, to acetyl CoA, which then can be converted to acetate (thereby producing additional ATP)412. The C. elegans propionate shunt genes are predicted to be localised to the mitochondria69, where malate dismutation occurs. The propionate shunt may be more important in nematodes than flatworms, since the ‘beta-alanine’ KEGG pathway (which overlaps with the propionate shunt) included two enzymes that were found in nematodes but not flatworms: 4- aminobutyrate aminotransferase (EC 2.6.1.19, which interconverts β-alanine and malonic semialdehyde, a propionate shunt intermediate) and the propionate shunt enzyme acyl-CoA dehydrogenase (EC 1.3.8.7, corresponding to C. elegans acdh-169; Supplementary Table 18a; Supplementary Table 20f).

The lactate dehydrogenase pathway The lactate dehydrogenase (LDH) pathway is an alternative pathway for the anaerobic production of ATP from carbohydrates401. The LDH pathway produces only two molecules of ATP per molecule of glucose, compared with three for the malate dismutation pathway, which may be why Ascaris prefers the malate dismutation pathway to survive in the fluctuating carbohydrate supply of the host intestine270. In contrast, some adult flatworms such as Schistosoma spp (which live in glucose-rich blood), C. sinensis, E. granulosus, and Taenia spp. are known to use the LDH pathway401. In agreement with this, the Compara family containing LDH (EC 1.1.1.27) was present in many copies in cestodes and schistosomatids (e.g. 6 E. multilocularis, 3 S. mansoni), but in few copies in most nematodes (e.g. 1 A. suum; family 136770; Supplementary Fig. 13g). An exception was the Ancylostoma clade, which had many LDH genes (7 A. duodenale, 8 A. caninum, 6 A. ceylanicum). The LDH pathway may therefore predominate in Ancylostoma, which like schistosomes feed on glucose-rich blood.

The GABA shunt In clade IIIb (Ascaridomorpha), we saw a small expansion of a glutamate dehydrogenase family (family 114732; EC 1.4.1.2, EC 1.4.1.3 or EC 1.4.1.4; 4-5 per Ascaris species; Supplementary Fig. 13h). Glutamate dehydrogenase interconverts glutamate and α-ketoglutarate, and in this case may possibly be involved in conversion of α-ketoglutarate to succinate via glutamate and gamma- aminobutyric acid (GABA), known as a GABA shunt (Fig. 5b). Although the other enzymes involved in the GABA shunt (GABA-α-ketoglutarate aminotransferase, EC 2.6.1.19; glutamate decarboxylase, EC 4.1.1.15; and succinate-semialdehyde dehydrogenase, EC 1.2.1.24) appeared to be present in some nematodes, they were missing from flatworms (in Clonorchis a gene was annotated as glutamate decarboxylase but had a top BLAST hit to bacteria, indicating possible sequence contamination; Supplementary Table 19d). Based on the existence of glutamate dehydogenase, glutamate decarboxylase and aminotransferase, in Ascaris, and its of GABA under anaerobiosis, it was previously suggested that it may have a GABA shunt (see review413). In humans, the GABA shunt (also known as ‘4-aminobutyrate bypass’) may regulate

88 redox homeostasis by controlling intracellular levels of α-ketoglutarate and fumarate414. Indeed, glutamate dehydrogenase has been reported in the tapeworm and in blood feeding stages of Haemonchus contortus, in which it is proposed to maintain the NAD/NADH redox balance in response to changes resulting from anaerobic metabolism (malate dismutation and LDH pathways)256,415,416. It is curious that we found glutamate dehydrogenase but not the other GABA shunt enzymes in flatworms such as Hymenolepis (Supplementary Table 19d). Furthermore, while glutamate dehydrogenase is present in the mitochondria of , the Hymenolepis and Haemonchus proteins appear to be cytoplasmic415,416. An alternative role for glutamate dehydrogenase is to convert glutamate to α-ketoglutarate (i.e. the reverse of the reaction that occurs in the GABA shunt) when amino acids are needed as precursors for glucose synthesis (gluconeogenesis), or for energy (the α-ketoglutarate can be used in the citric acid cycle)394, and this may possibly be its main role in flatworms.

Glycogen synthesis In trematodes, we identified an expansion of a cazyme class GT8 glycosyltransferase family, especially in Clonorchis (family 476359; 14 Clonorchis, 4 S. mansoni; Supplementary Fig. 13b). Expansion of GT8 enzymes (involved in lipopolysaccharide and glycogen biosynthesis) in trematodes may be linked to the reliance of their cercariae (free-living larvae) on energy stored within glycogen granules349,382.

4.5 Amino acid metabolism

Glycine cleavage system The primary difference in ‘gycine, serine and threonine metabolism’ is absence of the glycine cleavage system (GCS) in cestodes. The GCS uses four proteins (T-, P-, L- and H-proteins) to catabolise glycine when present in high concentrations, to form a one-carbon donor (5,10- methylenetetrahydrofolate) for various biosynthesis reactions within the cell, including purine, thymidylate and methionine synthesis417. In at least some bacteria, the GCS can be run in reverse to synthesise glycine. Though present in many helminths, the GCS appears to be missing from cestodes and filarial worms, due to the absence of detectable T- and P- proteins (EC 2.1.2.10 and 1.4.4.2, respectively) (Supplementary Table 19e). In contrast, the T- and P-proteins appear to be present in Dracunculus, sister group to filaria in clade IIIc, as they have members in the same Compara families as C. elegans T-protein (gcst-1) and P-protein (gldc-1). EC 1.8.1.4 (L-protein) is important for many other processes and seems to be ubiquitous. H-protein has been predicted to be present in the cestodes E. multilocularis and E. granulosus219,418 and the filarial nematode B. malayi372. Indeed we find members from the latter species in the same Compara family as C. elegans H-protein (gcsh-1). In cestodes and filarial nematodes, the H-protein could therefore represent either a vestigial gene from the lost pathway or perform an alternative and uncharacterised function. Indeed, ‘partial glycine cleavage systems’ have been found in the protozoan parasites Trichomonas vaginalis, which is anaerobic (just has L-protein and H- protein419), and Spironucleus salmonicida (has H-protein420), which is adapted to micro-aerobic environments. If cestodes and filaria cannot make 5,10-methylenetetrahydrofolate (to create a one- carbon pool) from glycine (plus tetrahydrofolate, THF) using the glycine cleavage system, they

89 may instead catabolise serine (plus THF) to form glycine and 5,10-methylenetetrahydrofolate using the enzyme serine hydroxymethyltransferase (EC 2.1.2.1; Supplementary Table 18a).

Duplications (defined here as EC numbers that are represented by ≥2 genes in ≥2 of the 81 helminth species) were identified in the KEGG glycine-serine-threonine (GST) KEGG pathway, with examples of paralogs for 71% of its enzymes (Supplementary Table 20b). For example, there are lineage-specific duplications in the enzymes that convert serine into glycine (serine hydroxymethyltransferase, EC 2.1.2.1; duplicated in trematodes and clade V), and produce serine from pyruvate (EC 4.3.1.19; duplicated in clades IV and V).

Amino acid auxotrophies Parasitic helminths have a wide host range and their nutritional requirements are expected to vary. Numerous biochemical and mineral components are essential nutrients, hence we examined the potential impact of the differences in annotated enzymes on the biosynthetic capacity (using Pathway Tools, see Methods) by predicting amino acid and vitamin auxotrophies for all tier 1 species. Helminths are auxotrophic for eight of the nine amino acids essential to humans and most animals (Phe, His, Lys, Leu, Met, Thr, Trp, and Val) (Supplementary Table 18g). Among these, valine was predicted by Pathway Tools to be auxotrophic in all helminths except C. sinensis; however closer examination, and in the absence of experimental data, indicated that this might be due to gene model errors for this genome. Auxotrophy of isoleucine, another amino acid essential for most animals421, could not be confirmed by Pathway Tools for many nematodes. However, isoleucine auxotrophy is highly likely, given that helminths are auxotrophic for leucine and that the pathways for producing isoleucine and leucine are similar, except for the threonine to 2- oxobutanoate step (EC 4.3.1.19). Besides these nine amino acids, arginine is also an essential amino acid for all species investigated here (Supplementary Table 18g). This has been recognised previously in nematodes422 but not in platyhelminths, and is due to loss of the urea cycle enzymes necessary for converting the precursor to arginine (ornithine carbamoyltransferase, argininosuccinate synthase, argininosuccinate lyase; Supplementary Table 18a). Unlike platyhelminths and nematodes, the arginine synthesis pathway has been conserved in humans and other vertebrates.

4.6 Nucleotide metabolism

Pyrimidine synthesis The KEGG pathway for ‘alanine, aspartate and glutamate metabolism’ lacks both carbamoyl phosphate synthetase (EC 6.3.5.5; CPS-II) and aspartate transcarbamylase (EC 2.1.3.2; ATCase), the first two enzymes in pyrimidine synthesis, across all cestodes presented here. We do however note that ATCase has been reported in the cestodes, Moniezia benedeni and Hymenolepis diminuta423, and CPS-II has in H. diminuta, albeit with low activity424. Furthermore, indirect evidence from in vitro studies suggests that M. corti larvae can synthesise de novo425. In addition, some cestode genes have the Pfam domains associated with CPS-II (PF02786, PF02787), but not the domain usually found in ATCase (PF02729). These two enzymes are

90 present as a fusion gene in C. elegans (pyr-1) and S. mansoni (Smp_186670). Their absence suggests that many cestodes might be lacking pyrimidine de novo biosynthesis. In agreement with this, previous analysis of the genome led to the suggestion that it cannot synthesise pyrimidines418. Thus, some cestodes may instead depend on their salvage from their hosts, which may occur even in helminth species that can synthesise pyrimidines391. It has previously been reported that several nematodes have also lost one or more enzymes in the pyrimidine synthesis pathway (several filaria and T. spiralis426). Indeed, both cestodes and filaria are missing KEGG module M00051, which is part of pyrimidine metabolism (Fig. 5a). Indeed we did not find the enzymes ATCase or CPS-II in any of the filaria (except for predicted CPS-II genes in B. timori and D. immitis, which had top BLASTP hits in UniProt to bacterial or proteins, so are likely to represent contaminant or genes, respectively), although we did find it in Dracunculus, their outgroup in clade III (Supplementary Table 19f). Indeed, it has previously been shown that the Wolbachia found in D. immitis has pyrimidine synthesis genes, and suggested that many filaria depend on their for pyrimidine synthesis63,427.

Purine synthesis Purine synthesis starts with the activation of ribose phosphate to phosphoribosyl pyrophosphate (PRPP), a step that is present in all nematodes and platyhelminths (KEGG Module 00005, Supplementary Table 18d). Despite some indirect biochemical evidence to the contrary in Brugia pahangi and Dirofilaria immitis428, the synthesis of IMP from PRPP (represented by module M00048) is completely absent from parasitic platyhelminths, filaria and most clade IV nematodes (Fig. 5a). This agrees with previous analyses of genomes of filaria such as B. malayi, D. immitis and L. loa63,372,426, and of flatworms such as Echinococcus granulosus418 and Schistosoma japonicum429. Many platyhelminths and nematodes clearly need to salvage , consistent with previous reports65. It has been suggested that in some filarial species such as B. malayi and D. immitis, de novo purine biosynthesis is provided by their Wolbachia endosymbionts63,372. Although complete pathways could be detected for several species including Trichuris spp, C. elegans, Dracunculus medinensis and Haemonchus contortus, for most of the other nematodes of clades I, IIIb and V, a near-complete pathway could be detected, lacking the terminal two steps (AICAR transformylase, EC 2.1.2.3, and IMP cyclohydrolase, EC 3.5.4.10; Supplementary Table 19g) that are needed to synthesise IMP from a precursor ribonucleotide. This suggests that these latter enzymes are particularly difficult to identify in nematodes, have been lost from several species, or that an alternative mechanism to produce IMP is used. In two cases (Elaeophora and Rhabditophanes), these terminal steps were the only parts of the pathway to be detected, although the Elaeophora gene encoding these steps (EEL_0000649601) appears to be contaminant as it has its top UniProt BLASTP hit to Bacteria (to UniProt J2VZF7 from Phyllobacterium, E-value 0.0, 89% id.). Fasciola hepatica appeared as an outlier in our analyses, with a complete pathway apparently present. However, inspection of individual hits revealed that the pathway is provided by the recently reported Neorickettsia endosymbiont47. Interconversions between IMP and other purine nucleosides (represented by KEGG modules M00049 and M00050) are ubiquitous in nematodes and platyhelminths: they were detected in all tier 1 species except for F. hepatica, which lacked EC 4.3.2.2 (adenylosuccinase) and H. contortus, which appeared to lack EC 6.3.5.2 (GMP synthase), until a reasonable Fasciola

91 candidate (D915_00696) was obtained by looking at Compara families and a Haemonchus candidate was found from a BLASTP-search of the published Australian strain256 using H. placei GMP synthase as a query sequence (Supplementary Table 18a-b).

4.7 Metabolism of cofactors and vitamins

Haem metabolism The only KEGG pathway where platyhelminths have significantly higher coverage than nematodes is ‘porphyrin and chlorophyll metabolism’ (Supplementary Fig. 12a). This is primarily due to presence of multiple enzymes of the haem biosynthesis pathway in platyhelminths, while all but the terminal ferrochelatase (EC 4.99.1.1) step are missing from most nematodes (Supplementary Table 19h; Supplementary Table 20i), consistent with their inability to synthesise haem16,430,431. Platyhelminths, including S. mansoni (despite conflicting biochemical data16), appear to have a complete haem biosynthesis pathway, although some parts of the pathway are quite diverged. For example, our EC annotation (Supplementary Table 18a) was missing URO synthase (EC 4.2.1.75) and protoporphyrinogenase (ECs 1.3.3.4/1.3.5.3), but we were able to find reasonable candidates for these genes in platyhelminths by looking at Compara families and BLAST searches. That is, the Compara family containing human URO synthase (UROS or HemD, ENSG00000188690) is family 902846, which includes the S. mansoni gene Smp_079840, a good candidate for S. mansoni URO synthase. In addition, using human protoporphyrinogenase (PPOX) in a BLASTP search against WormBase Parasite135, we found a hit to S. mansoni Smp_068250 (E- value 1e-6, 38% identity), which is the only S. mansoni gene that has a protoporphyrinogenase domain (IPR004572; E-value 3e-44), so is a good candidate for S. mansoni protoporphyrinogenase. In support of the presence of haem biosynthesis in platyhelminths, transcripts for haem biosynthesis enzymes were previously reported in schistosome EST datasets432, while haem biosynthesis has been reported in the free-living platyhelminth Schmidtea mediterranea433.

Vitamin metabolism Retinol (vitamin A) can be obtained from the diet either by hydrolysing dietary retinyl esters (e.g. by using EC 3.1.1.3, present in most nematodes and many flatworms; Supplementary Table 18a) or splitting carotenoids (e.g. β-carotene). The latter route uses beta-carotene dioxygenase (EC 1.13.11.63, formerly EC 1.14.99.36), which is only present in some nematodes and is absent from flatworms (Supplementary Table 18a-b); and retinol dehydrogenase (EC 1.1.1.300), which appears to be present in most nematodes and platyhelminths, since they have members belonging to a family (id. 133629) containing the human retinol dehydrogenase gene (ENSG00000139988 = RDH12; Supplementary Table 18b). Further, like humans, most of them can salvage thiamine (vitamin B1)434 and pyridoxal (vitamin B6)435, two important cofactors (Supplementary Table 18g). Folate derivatives (e.g. tetrahydrofolate) are important co-factors facilitating one-carbon transfer in key biosynthetic processes (synthesis of purines, pyrimidines, methionine) and appear to be procured from the environment436 as folic acid, as is the case for humans and other animals437. Folate polyglutamylation, a key step for generating the preferred form of the cofactor utilised by enzymes (by enzymes dihydrofolate synthetase, EC 6.3.2.12, and folylpolyglutamate synthetase,

92 EC 6.3.2.17), is active in all nematodes (Supplementary Table 18g); its apparent absence in platyhelminths is likely due to the challenge of annotating divergent enzymes436, since we do find putative flatworm folylpolyglutamate synthetase genes (EC 6.3.2.17) by analysing Compara families (Supplementary Table 18b). In general, spiruromorphs (including filaria) and cestodes seem to lack many pathways for post-processing of folates (Supplementary Table 18g).

5. Identifying New of Anthelmintic Drug Targets and Drugs

Authors: Avril Coghlan, James Cotton, Andrew R. Leach, Prudence Mutowo and Noel O’Boyle

5.1 Summary of approach There is a pressing need for new anthelmintic drugs, since existing ones suffer from low efficacy, serious side-effects or rising in parasite populations8,438,439. The availability of genome data provides new opportunities to identify compounds structurally different to existing anthelmintics, that may provide new treatments. This can done by repurposing existing drugs and/or by identifying biologically active compounds from other areas of drug discovery as starting points for hit/lead discovery. We therefore searched for the closest sequence (BLASTP) match of each helminth protein in the ChEMBL database197, and if a significant match was found, we retrieved compounds with bioactivities for that ChEMBL target (Extended Data Fig. 7). The helminth protein was then assigned a score based on its attractiveness as a potential new drug target, taking into account factors such as the strength of the BLAST match, whether the ChEMBL target lacked human homologs, and whether C. elegans or D. melanogaster homologs have lethal or sterile phenotypes when disrupted. This scoring system was devised using a training set of known/likely targets of existing anthelmintic drugs used for humans, to ensure that helminth proteins having high scores are good candidates for new anthelmintic drug targets. Our goal was to identify candidates for new classes of anthelmintic compounds, rather than molecules structurally similar to known anthelmintic compounds. Molecules associated with known targets are most attractive for drug development if, for example, the target has a lethal knockout phenotype. Thus, we focussed on compounds active against our top (15% of) highest-scoring targets. We then filtered the target-compound pairs to retain those that were the most promising by virtue of having a high pCHEMBL score (reflecting high potency/affinity for the ChEMBL target) or appearing in a PDBe structure with the ChEMBL target. This gave us a set of top candidate molecules, which includes some approved drugs (mostly not anthelmintics). These could be considered for repurposing as novel anthelmintics, which would save considerable effort and expense, a key factor in drug discovery for neglected diseases. The other compounds in the set are typically from the medicinal chemistry literature and so represent molecules to test for anthelmintic activity. ChEMBL is a manually curated database of molecules with biological or pharmacological activity. To establish the suitability of ChEMBL as a source of potential new anthelmintic agents, we checked for the presence of existing compounds. Of 261 known anthelmintic compounds (current veterinary or medical anthelmintics, plant-nematicidal agents, or medicinal chemistry compounds) 208 out could be identified, including 23 of the 24 drugs used for humans with WHO ATC code

93 P02 (WHO anthelmintics) and 33/37 veterinary drugs with the WHO ATCvet classification (Supplementary Table 21a, columns A-B). A list of known/likely targets for the 24 WHO compounds was collated by searching the DrugBank database198 and by additional literature searches. At least one plausible single-protein target was identified in ChEMBL for 19/24 WHO anthelmintics, and for an additional nine known anthelmintic compounds not in the WHO set (Supplementary Table 21b). Other WHO compounds lack targets in ChEMBL either because their target proteins are unknown (e.g. , , ); or their known/likely targets lack sequence matches in ChEMBL (e.g. oxamniquine); or their targets are complexes and not single proteins in ChEMBL (e.g. oxantel). The suitability of ChEMBL for searching is underlined by the fact that ChEMBL contains homologs for the majority of known/likely single-protein targets of existing anthelmintic drugs.

5.2 Ranking helminth proteins as likely drug targets To identify helminth proteins likely to be good potential drug targets, we ran BLASTP to compare all (528,469) helminth proteins from the 33 tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7) with known targets from the ChEMBL 21 database that contains 1,592,191 compounds and 11,019 targets. We focussed on the 6,261 single protein ChEMBL targets that may be easier to develop as drug targets, rather than protein complexes. 106,278 helminth genes were found with top BLASTP hits (E≤1e-10) to 3,994 individual single- protein ChEMBL targets. We then assigned a target score to each of the 106,278 helminth proteins in an attempt to quantify their likely quality as targets for known compounds (Extended Data Fig. 7). The major contributors to this score were the quality (E-value and target coverage) of the BLASTP match between the helminth and ChEMBL proteins; whether the ChEMBL target had a close human BLAST match (since targeting a protein that lacks a human homolog is less likely to cause undesirable side-effects); and whether the helminth gene has a C. elegans or D. melanogaster homolog with a relevant phenotype (for example, a lethal or sterile phenotype) (Extended Data Fig. 7). Other minor contributors included whether the helminth protein is a predicted chokepoint enzyme; whether it is expressed in key life cycle stages (for example, adult); whether it has homologs in most members of a major helminth clade (for example, in most nematodes); whether it lacked within-species paralogs; whether it belonged to a Compara family with a highly conserved alignment; whether the matching ChEMBL protein had a structure in the PDBe205; and whether it was from a non-chordate animal. The list of 106,278 helminth genes contained a lot of related genes (for example, 220 tubulin genes). To remove this redundancy, we first took just the highest-scoring helminth match to each ChEMBL protein, and then further filtered to just take highest-scoring helminth member from each Compara family, leaving 1,925 helminth proteins (matching 1,925 ChEMBL targets).

5.3 Choice of weightings in the scoring system for targets When developing our scoring system for helminth proteins, we initially assigned weights to contributors based on prior knowledge about favourable properties (e.g. a strong BLASTP match

94 to a CHEMBL target and a predicted lethal knockout phenotype were assigned large weightings). We then checked that the known/likely targets for most WHO anthelmintics (listed in Supplementary Table 21b) were in the top 15% (top 289/1925) of this ranked list, and manually adjusted the weightings until this was the case. In this way, the known/likely targets of existing anthelmintic drugs used for humans were used as a ‘training set’ to ensure that high scores reflected an increased likehood of being good candidate targets for new anthelmintics. Using our final set of weightings (see Extended Data Fig. 7), the top 15% of our target list included at least one known/likely target for 17/19 WHO anthelmintics with targets in ChEMBL (, , , , , , , pyrantel, , bephenium, , , , , , , ). The two exceptions are piperazine and . Single-protein targets for these compounds are present in ChEMBL and have (low-ranked) hits in our analysis, but these may not be the key targets in vivo (see Supplementary Table 21b and440,441). Finally, we checked whether 13 potential drug targets, identified in at least two previous helminth genome studies53,172,183,325, tended to score well in our scoring system (Supplementary Table 21c). Eleven were in the list of helminth targets, eight of which were in the top-scoring 15% of targets, and two in the top-scoring 25%. Our scoring approach thus captured known targets of most clinical anthelmintics, and is able to identify candidates from previous in silico studies, giving us confidence that other helminth proteins that have high scores are good candidates for new anthelmintic drug targets.

5.4 Identifying potential new anthelmintic targets, and drugs From the BLASTP searches against ChEMBL single-protein targets using the helminth proteomes as queries (see above), we identified 827,889 small molecules with recorded biological activities. Concentrating on the top 15% (289 of 1925) highest-scoring helminth proteins (Supplementary Table 21d column A), we took compounds associated with their top BLASTP hits in ChEMBL and compounds associated with the top BLAST hits of other members of the same Compara family. Of our 289 highest-scoring helminth targets, 286 targets had such compounds (292,499 unique compounds). To increase the chance that target-compound associations in ChEMBL correspond to bioactivity in vivo, we used two filters to select the target-compound pairs with the most convincing evidence: (i) co-appearing in a PDBe205 structure with the ChEMBL target; or (ii) having a high pChEMBL score (median >5 for phase III/IV drugs, or >7 for medicinal chemistry compounds), reflecting high potency/affinity for the ChEMBL target). Since existing phase III/IV drugs are more attractive for developing (repurposing) as novel anthelmintics, we used a less stringent pChEMBL filter for phase IIII/IV drugs (median pChEMBL >5) than for medicinal chemistry compounds (median >7). This gave us a list of 52,154 ‘top candidate molecules’, which includes 795 approved drugs (phase IV) and 164 phase III drugs that have 102 targets. These drugs, which could be considered for repurposing as novel anthelmintics, were classified into 730 chemical classes (see Supplementary Methods for details of identifying classes). The remaining 51,195 molecules, which are medicinal chemistry compounds (rather than drugs), span 13,191 additional chemical classes and represent a screening pool that the research community could test for anthelmintic activity.

95 5.5 Diverse screening set In order to create a diverse screening set of compounds for screening in whole-worm assays, we first took the 6432 (12.3%) of the 52,154 compounds that are listed as commercially available in ZINC15211 (see Methods), which were clustered into 4543 groups. Then, to select one compound from each cluster, we devised a ranking score. Contributors to this score included the QED score of the compound (a measure of oral drug-likeness208); the phase of approval of the compound as a drug (e.g. phase IV or III approval); whether the compound (if an approved drug) could be administered orally or topically; and whether it had serious side effects and/or predicted toxicology targets (see Extended Data Fig. 7; Supplementary Methods). The phase of drug approval (phase 0-IV) was given the highest weight, assuming that drugs with higher phase of approval will be quicker and cheaper to develop as novel anthelmintic drugs. For each target, the highest- scoring compound from each chemical class was taken, giving a ‘diverse screening set’ of 5046 compounds, including 817 drugs (with phase IV/III approval) in 661 chemical classes (Supplementary Table 21d columns V, W), and 4229 medicinal chemistry compounds in 3936 chemical classes (Supplementary Table 21d column X), excluding chemical classes containing known anthelmintic compounds (from Supplementary Table 21a). Note that there still are multiple compounds from some chemical classes in this screening set, because when collapsing by chemical class, we only collapsed by chemical classes for each target, so the final screening set may have had compounds in the same chemical class for two different targets. A brief literature search revealed that some of our top 20 targets have candidate compounds that have previously been shown to have anthelmintic activity: for example, a candidate for (Compara family 644442) is the anticancer drug fluorouracil, which has anthelmintic activity against Echinococcus larvae and cells442. The self-organising map (Fig. 6a) confirms the diversity of these 5046 compounds, as they are distributed across 390 of the 400 cells while the known anthelmintics are much more clustered in the map, in only 127 cells (Fig. 6b). Randomly selected sets of 236 of the 5046 compounds are more widely distributed than this, filling a median of 174 cells (range 147-200 for 500,000 randomisations), so a similar distribution of the novel compounds and known anthelmintic compounds can be rejected at P > 2x10-6. The average distance between cells occupied by randomly chosen pairs of screening compounds is also significantly greater than that for pairs of known anthelmintics, even when compounds sharing a single cell are excluded (median distances 9 vs 8, means 8.19 and 9.25; P < 2.2e-16 by Wilcoxon signed-rank test for 10,000 random pairs from each group), suggesting that this clustering extends beyond compounds sharing the same cell but is also represented in the topology of the map.

5.6 A high priority target set, giving a smaller diverse screening set Since 5046 compounds is still too many for smaller laboratories, we also created a second smaller screening set by identifying 40 ‘high priority targets’ among our top 15% of targets (Supplementary Table 21d, column S). These consisted of several classes of interesting targets that we noticed among our top 15% of targets. Firstly they included targets lacking human homologs (three targets, as defined using our ‘ChEMBL non human score’; see Methods) which suggest possible selective compounds; or are clade-specific chokepoints, for example, predicted in

96 all nematodes but no platyhelminths (two targets), so again may have specificity. Secondly, we also included targets that have greatly expanded in some helminths (five targets; from Supplementary Table 9a; see Gene Family Expansions above), so may perform functions important to the parasite. Thirdly, we included known/likely anthelmintic targets (from Supplementary Table 21b) or targets in the same pathway as known anthelmintic targets (30 targets, Extended Data Fig. 8), which suggest compounds which may also be anthelmintic but could possibly be more efficacious and/or give less severe side effects. These 40 high priority targets have 720 candidate compounds, including 181 drugs (phase IV/III) in 173 chemical classes, and 539 medicinal chemistry compounds in 530 chemical classes. A brief literature search revealed encouraging evidence that some of these have anthelmintic activity. For example, we suggest several compounds for the target glycogen phosphorylase, which is in the same pathway as a known anthelmintic target (glycogen phosphorylase phosphatase, inhibited by niridazole), including the phase III drug alvocidib (flavopiridol) that has anthelmintic activity in C. elegans71. Another example is the target cathepsin B, which has greatly expanded in clade Va (strongylid- other; Supplementary Table 9a), for which we suggest several compounds including the phase III drug odanacatib, which has recently been shown to have anthelmintic activity against hookworms72.

Supplementary Figures

Supplementary Fig. 1. Patterns of gene family sharing. (a) The numbers of gene families (pink bars; values on left-hand y-axis) and the numbers of genes in those families (blue bars; values on right-hand y-axis) with particular patterns of sharing between high-level groups in our Compara data. Shading in the lower panel from pink to blue represents how widespread each set of families are, with pink representing families specific to one group and dark blue those families present in all groups. (b) Scatterplot of gene family size against the number of species a family is present in, with each point representing a single gene family (families with less than 3 genes are excluded), and points coloured according to the number of higher-level taxonomic groups they are shared between, as in the lower part of panel (a).

Supplementary Fig. 2. Phylogenetic tree of ferrochelatases. Branches are coloured by membership of protein sequences to ferrochelatase Compara familes. Bootstrap support of the main branches is indicated. FeCL (nematode): nematode specific non- functional ferrochelatase-like proteins, devoid of active site (family 740872); CladeIVa FeCL (nematode): synapomorphic non-functional ferrochelatase-like family of nematode clade IVa (family 1184543); FeCH (Non-nematode): functional ferrochelatase family composed of taxa of non-nematode phyla (family 850580 and part of family 787620); Alphaproteobacteria: ferrochelatases of Alphaproteobacteria (Rhizobiales); Bacteroidetes: ferrochelatases of Leadbetterella byssophila and Mucilaginibacter paludis; Gammaproteobacteria: ferrochelatases of Pseudomonas species; Wolbachia-like: ferrochelatases of Wolbachia, Ehrlichia and

97 Hydrogenobacter species; and HGT-FeCH (nematode): clade III/IV specific functional ferrochelatase acquired through horizontal gene transfer from Alphaproteobacteria (family 787620).

Supplementary Fig. 3. Expanded families involved in immunity and development. Expanded gene families with potential roles in immunity and development. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 4. Distribution and phylogeny of group I SCP/TAPS genes in strongylid nematodes. (a) The phylogeny of the group I clade expanded in strongylids (clades Va, Vb and Vc). Red dots show high bootstrap values (≥0.8). For visualisation purposes, the numbered nodes were collapsed in Fig. 3. Zooming into details of a branch reveals that there have been both species- specific expansions (e.g. node 2313) and more ancient expansions shared by several strongylid species (e.g. node 1618). (b) A heatmap showing the distribution of SCP/TAPS gene counts, where each row of the heatmap corresponds to a leaf of the tree in (a). Note that in some cases a leaf corresponds to a consensus sequence representing a group of genes, which belong to species from the same species group (Supplementary Information: Methods 23). (c) A heatmap showing the number of genes per species for each of the numbered nodes from (a).

Supplementary Fig. 5. Distribution of SCP/TAPS genes in each species. The number of length-filtered SCP/TAPS genes used as input for the phylogenetic analysis is shown (Supplementary Information: Methods 23). The vertical black lines show the median numbers of Group 1 and Group 2 genes per species, across all species.

Supplementary Fig. 6. Hypothetical proteins. (a) Histograms of log(family size) for the Compara gene families that lacked functional annotation (protein names; Supplementary Information: Results 2.4) and families with annotations (red and blue lines, respectively). Half of the Compara gene families (46.9%) lacked functional annotation. These families tended to be smaller than those with annotations (P<2.2e-16, Kolmorov-Smirnov test). (b) Size distribution of hypothetical families shows that many were found in a large number of species, suggesting that they contained genuine genes of unknown function.

Supplementary Fig. 7. Striking expansions in families with poorly defined roles. Gene families with striking variation across species but uncharacterised (a-i), or poorly characterised (j) functions. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 8. Protease family expansions. Expanded gene families of proteases and protease inhibitors. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene

98 count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 9. Heatmap of ligand gated ion channels. Relative abundance profiles for all 99 LGICs represented in at least 3 of the 81 helminth species (Supplementary Information: Methods 19). 3 LGICs present in fewer than 3 species were omitted from the visualisation.

Supplementary Fig. 10. Inferred phylogeny of the cys-loop superfamily in platyhelminths and nematodes. Posterior probabilities were calculated from 8 reversible jump MCMC chains using MrBayes. The tree is rooted between nicotinic acetylcholine receptors and non-nicotinic anion channels. Posterior probabilities are displayed for nodes that correspond to our classification of cys-loop proteins.

Supplementary Fig. 11. Heatmap of ABC transporters. Relative abundance profiles for 50 ABC transporter classes (Supplementary Information: Methods 19).

Supplementary Fig. 12. Comparison of pathway coverage and variation. (a) Taxonomic differences in KEGG reference pathway coverage. For nematode clades, the comparison is with the union of the other five nematode groups. Platyhelminths, Clade IIIc- and Clade IVa- show the most consistent overall pattern. Comparisons with Wilcoxon test P-value <0.05 (FDR corrected using the Benjamini-Hochberg procedure) are considered significant. (b) Within-group variation in KEGG pathway coverage, and differences between groups. Only helminth-relevant KEGG metabolic pathways were considered (Supplementary Information: Methods 25). The y-axis is the coefficient of variation of the KEGG pathway coverage. For nematode clades, the comparison is with the union of the other five nematode groups. (c) Variation among nematodes and platyhelminths in KEGG pathway coverage aggregated according to superpathways. In panels (b) and (c), statistically significant comparisons are shown (red = lower; green = higher): * 0.01

Supplementary Fig. 13. Metabolism-related family expansions. Expanded Compara families with metabolism-related functions. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 14. Maximum likelihood phylogenies of two putative horizontal transfers. Phylogeny of helminth proteins, closely related sequences identified by BLAST and selected other sequences for (a) a clade IIIb-specific cobalamin-related family (CobQ/CbiP) gained from Bacteria, and (b) an acetate/succinate transporter in clade I nematodes that appears to have been gained from Bacteria, and is likely to participate in acetate/succinate uptake or efflux.

99 Supplementary Fig. 15. Genome assembly pipelines. The individual pipelines used (a) WTSI (b) MGI and (c) BaNG. Details as defined in Methods.

Supplementary Fig. 16. Gene-finding pipelines. The individual pipelines used at (a) WTSI, (b) MGI and (c) BaNG. Details as defined in Methods.

Supplementary Fig. 17. Network representation of Compara families. Species are represented as nodes in the network and edges between two nodes are weighted by the number of times genes of both species appear together in a Compara family. Edges are coloured based on the colour of the phylum of the nodes they connect. Nodes are scaled based on relative proteome size for each species, coloured based on taxonomic membership and labelled according to the following list: 0 = Acanthocheilonema viteae, 1 = Amphimedon queenslandica, 2 = Ancylostoma caninum, 3 = , 4 = Ancylostoma duodenale, 5 = Angiostrongylus cantonensis, 6 = Angiostrongylus costaricensis, 7 = Anisakis simplex, 8 = Ascaris lumbricoides, 9 = Ascaris suum, 10 = Brugia malayi, 11 = Brugia pahangi, 12 = Brugia timori, 13 = Bursaphelenchus xylophilus, 14 = Caenorhabditis elegans, 15 = Capitella teleta, 16 = Ciona intestinalis, 17 = Clonorchis sinensis, 18 = Crassostrea gigas, 19 = Cylicostephanus goldi, 20 = Danio rerio, 21 = Dictyocaulus viviparus, 22 = Dibothriocephalus latus, 23 = , 24 = Dracunculus medinensis, 25 = Drosophila melanogaster, 26 = Echinococcus granulosus, 27 = Echinococcus multilocularis, 28 = Echinostoma caproni, 29 = Elaeophora elaphi, 30 = Enterobius vermicularis, 31 = Fasciola hepatica, 32 = Globodera pallida, 33 = Gongylonema pulchrum, 34 = Haemonchus contortus, 35 = Haemonchus placei, 36 = Heligmosomoides bakeri, 37 = Homo sapiens, 38 = Hydatigera taeniaeformis, 39 = Hymenolepis diminuta, 40 = Hymenolepis microstoma, 41 = , 42 = Ixodes scapularis, 43 = Litomosoides sigmodontis, 44 = , 45 = Meloidogyne hapla, 46 = Mesocestoides corti, 47 = Necator americanus, 48 = Nematostella vectensis, 49 = Nippostrongylus brasiliensis, 50 = Oesophagostomum dentatum, 51 = Onchocerca flexuosa, 52 = Onchocerca ochengi, 53 = Onchocerca volvulus, 54 = Panagrellus redivivus, 55 = , 56 = Parastrongyloides trichosuri, 57 = Pristionchus pacificus, 58 = Protopolystoma xenopodis, 59 = Rhabditophanes kr3021, 60 = Romanomermis culicivorax, 61 = Schistocephalus solidus, 62 = Schistosoma curassoni, 63 = Schistosoma haematobium, 64 = , 65 = Schistosoma mansoni, 66 = Schistosoma margrebowiei, 67 = Schistosoma mattheei, 68 = Schistosoma rodhaini, 69 = Schmidtea mediterranea, 70 = Soboliphyme baturini, 71 = Spirometra erinaceieuropaei, 72 = Strongyloides papillosus, 73 = Strongyloides ratti, 74 = Strongyloides stercoralis, 75 = Strongyloides venezuelensis, 76 = Strongylus vulgaris, 77 = Syphacia muris, 78 = , 79 = Taenia solium, 80 = Teladorsagia circumcincta, 81 = callipaeda, 82 = Toxocara canis, 83 = Trichinella nativa, 84 = Trichinella spiralis, 85 = , 86 = Trichoplax adhaerens, 87 = Trichuris muris, 88 = , 89 = , 90 = Wuchereria bancrofti.

Supplementary Fig. 18. Metabolic chokepoints. (a) Number of chokepoints before and after hole filling. (b) Clustering of the species based on presence and absence of the chokepoints. (c) Sharing between nematodes and platyhelminths, of chokepoints present in at least one species of the set (left), and chokepoints present in all species of the set (right).

100 Supplementary Fig. 19. Unique enzymes annotated and KEGG metabolic modules among the 81 nematode and platyhelminth species. (a) Counts of enzymes (unique EC identifiers) in the 81 platyhelminth and nematode species (i.e. present in any one of the set). (b) Distribution of conserved ECs across nematode and platyhelminth species (i.e. conserved across all species of the set). (c,d) Number of unique ECs annotated and KEGG modules deemed complete in all 81 species (in c) and the tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7) (in d), along with their proteome size. (e) clustering of all 81 species based on the KEGG module presence (Jaccard similarity index and Ward’s linkage). Bootstrap support values are indicated.

Supplementary Fig. 20. The mitochondrial gene order and phylogeny for nematode species. Elaeophora elaphi, Globodera pallida, Meloidogyne hapla, and Soboliphyme baturini were excluded from the analysis because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. Asterisks indicate that the assembly contains small gaps. The scale bar shows the number of amino acid substitutions per site.

Supplementary Fig. 21. The mitochondrial gene order and phylogeny for trematode species. Schistosoma rodhaini was excluded because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. The scale bar shows the number of amino acid substitutions per site.

Supplementary Fig. 22. The mitochondrial gene order and phylogeny for cestode species. Platyhelminth Schmidtea mediterranea used as an outgroup. Protopolystoma xenopodis was excluded because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. The scale bar shows the number of amino acid substitutions per site.

Supplementary Fig. 23. Cazymes heatmap. Relative bundance profiles for 81 Cazyme families represented in at least 3 of the 81 helminth species. 25 families present in fewer than 3 species were omitted from the visualization. GH = Glycoside hydrolases, PL = Polysaccharide Lyases, GT = Glycosyltransferases, CBM = Carbohydrate-Binding Modules.

Supplementary Fig. 24. Example of inexplicable expansion. Expansion of the Compara XPG gene family. The plot shows the gene count in each species, superimposed upon our species tree. A scale bar beside the plot shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 25. Domain composition of SCP/TAPS genes. A domain combination is included in this matrix if the combination can be detected in more than five sequences amongst all species. The column labelled ‘excluded’ gives the numbers of genes discarded from the analysis, based on their length (Supplementary Information: Methods 23).

101 The ‘included’ column gives the remaining numbers of length-filtered SCP/TAPS genes in each species.

Supplementary Fig. 26. Kinases heatmap. Relative abundance profiles for all kinase genes, and for 343 individual kinase Compara gene families represented in at least 5 of the 81 helminth species (Supplementary Information: Methods 19). 689 Compara families present in fewer than 5 species and an additional 43 unclassified kinase families were omitted from the visualisation. ‘Total Directly Annotated Kinases’ represents the relative abundance of kinases that were directly annotated per species, while ‘Per Kinase Compara Family’ represents the relative abundance of genes per annotated kinase family.

102 Supplementary Tables

Supplementary Table 1: Properties of assemblies Supplementary Table 2: Gene set properties Supplementary Table 3: Rationale, classification and taxonomy of species used Supplementary Table 4: Origin of data used in comparison of gene sets in Compara Supplementary Table 5: Repeat content Supplementary Table 6: Results of Bayesian mixed effect model for genome size Supplementary Table 7: GO annotation, predicted secretion data and transmembrane domains, and per-species gene counts for Compara gene families Supplementary Table 8a: Counts of synapomorphic gene families at key nodes of the species tree Supplementary Table 8b: Functional annotation of synapomorphic families Supplementary Table 9a: 995 gene families selected based on variation scores Supplementary Table 9b: Gene families excluded from expanded family analysis Supplementary Table 10a: Counts of SCP/TAPS genes identified in the genomes Supplementary Table 10b: Identifiers of SCP/TAPS genes identified in the genomes Supplementary Table 11a: MEROPS protease family counts per species Supplementary Table 11b: MEROPS protease inhibitor family counts per species Supplementary Table 11c: Protease and protease inhibitor genes Supplementary Table 12: Species life history trait Supplementary Table 13: Parasite tissue tropism in definitive host Supplementary Table 14a: Nematode-specific domain combinations Supplementary Table 14b: Platyhelminth-specific domain combinations Supplementary Table 15a: GPCR class, confidence level, and number of members in the final 230 Compara GPCR families Supplementary Table 15b: Species-wise distribution of GPCRs in the final 230 Compara GPCR families Supplementary Table 15c: Original putative GPCRs Supplementary Table 15d: Final list of GPCRs Supplementary Table 15e: Genes used as initial seeds to find GPCR Compara families Supplementary Table 16a: Ligand gated ion channel predictions in our nematode and platyhelminth gene sets Supplementary Table 16b: Accessions for ligand gated ion channels (from the WormBase Parasite database) used in the phylogenetic analysis Supplementary Table 17: ABC transporter predictions Supplementary Table 18a: Enzyme Commission number (EC) annotations for all 81 nematode and platyhelminth species Supplementary Table 18b: Low confidence EC annotations based on membership of Compara families Supplementary Table 18c: KEGG metabolic module completion for all 81 nematode and platyhelminth species, without using ECs from hole-filling Supplementary Table 18d: KEGG metabolic module completion for the 33 tier 1 nematode and platyhelminth species, including ECs from hole-filling

103 Supplementary Table 18e: Heatmap of pathway conservation in KEGG pathways for nematodes and platyhelminths, without using ECs from hole-filling Supplementary Table 18f: Heatmap of pathway conservation in KEGG pathways for nematodes and platyhelminths, using ECs from hole-filling Supplementary Table 18g: Amino acid and vitamin auxotrophy predictions for tier 1 species, assigned by Pathway tools Supplementary Table 18h: Metabolic chokepoints detected in helminths Supplementary Table 19a: Beta-oxidation enzyme annotation Supplementary Table 19b: Ketone bodies metabolism enzyme annotation Supplementary Table 19c: Glyoxylate cycle specific enzyme annotation Supplementary Table 19d: GABA-shunt enzyme annotation Supplementary Table 19e: Glycine cleavage system enzyme annotation Supplementary Table 19f: Pyrimidine synthesis enzyme annotation Supplementary Table 19g: IMP biosynthesis enzyme annotation Supplementary Table 19h: Haem synthesis enzyme annotation Supplementary Table 20: KEGG pathways with most unique ECs (present in flatworms but absent in nematodes, or vice versa) Supplementary Table 21a: Known anthelmintic drugs (for humans and veterinary animals), nematocides, and other anthelmintic compounds Supplementary Table 21b: Known (or hypothesised) drug single-protein targets for known anthelmintic drugs and compounds Supplementary Table 21c: Novel anthelmintic drug targets proposed in literature Supplementary Table 21d: Potential anthelmintic drug targets Supplementary Table 22: Sample and sequencing information Supplementary Table 23: RNAseq used for gene-finding QC Supplementary Table 24a: Nematode and platyhelminth mitochondrial genomes Supplementary Table 24b: Codon usage in nematode mitochondrial genomes Supplementary Table 24c: Codon usage in cestode mitochondrial genomes, and Schmidtea Supplementary Table 24d: Codon usage in trematode mitochondrial genomes Supplementary Table 25a: Kinase counts per kinase class and per Compara family Supplementary Table 25b: Kinase gene identifiers in nematode and platyhelminth species Supplementary Table 25c: Relative kinase total abundances and per-family relative abundances for kinase families Supplementary Table 26a: InterPro (IPR) enrichment in different species groups Supplementary Table 26b: GO enrichment in different species groups Supplementary Table 26c: Pfam enrichment in different species groups Supplementary Table 26d: InterPro (IPR), Pfam and GO enrichment in individual species Supplementary Table 26e: Methods for GO, InterPro (IPR) and Pfam enrichment analysis Supplementary Table 27: CAZYME predictions for the 81 nematodes and platyhelminths

104 Additional References

73 Simpson, A. J., Sher, A. & McCutchan, T. F. The genome of Schistosoma mansoni: isolation of DNA, its size, bases and repetitive sequences. Molecular and biochemical parasitology 6, 125-137 (1982). 74 Camberis, M., Le Gros, G. & Urban, J., Jr. Animal model of Nippostrongylus brasiliensis and Heligmosomoides polygyrus. Curr Protoc Immunol Chapter 19, Unit 19 12, doi:10.1002/0471142735.im1912s55 (2003). 75 Koziol, U., Dominguez, M. F., Marin, M., Kun, A. & Castillo, E. Stem cell proliferation during in vitro development of the model cestode Mesocestoides corti from larva to adult worm. Front Zool 7, 22, doi:10.1186/1742-9994-7-22 (2010). 76 Choi, Y. J. et al. Genomic introgression mapping of field-derived multiple-anthelmintic resistance in Teladorsagia circumcincta. PLoS Genet 13, e1006857, doi:10.1371/journal.pgen.1006857 (2017). 77 Malacrida, F. et al. Emergence of canine ocular Thelaziosis caused by in southern Switzerland. Vet Parasitol 157, 321-327, doi:10.1016/j.vetpar.2008.07.029 (2008). 78 Summers, R. W., Elliott, D. E., Urban, J. F., Jr., Thompson, R. & Weinstock, J. V. Trichuris suis therapy in Crohn's disease. Gut 54, 87-90, doi:10.1136/gut.2004.041749 (2005). 79 Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nature methods 6, 291-295, doi:10.1038/nmeth.1311 (2009). 80 Park, N., Shirley, L, Gu, Y., Keane, T. M., Swerdlow, H., Quail, M. A. An improved approach to mate-paired library preparation for Illumina sequencing. Methods in Next Generation Sequencing 1, 10-20, doi:https://doi.org/10.2478/mngs-2013-0001 (2013). 81 Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22, 549-556, doi:10.1101/gr.126953.111 (2012). 82 Gremme, G., Steinbiss, S. & Kurtz, S. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 10, 645-656, doi:10.1109/TCBB.2013.68 (2013). 83 Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821-829 (2008). 84 Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre- assembled contigs using SSPACE. Bioinformatics 27, 578-579, doi:10.1093/bioinformatics/btq683 (2011). 85 Boetzer, M. & Pirovano, W. Toward almost closed genomes with GapFiller. Genome biology 13, R56, doi:10.1186/gb-2012-13-6-r56 (2012). 86 Tsai, I. J., Otto, T. D. & Berriman, M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome biology 11, R41, doi:10.1186/gb-2010- 11-4-r41 (2010). 87 Otto, T. D., Sanders, M., Berriman, M. & Newbold, C. Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26, 1704-1707 (2010). 88 Hunt, M. et al. REAPR: a universal tool for genome assembly evaluation. Genome biology 14, R47, doi:10.1186/gb-2013-14-5-r47 (2013). 89 Bonfield, J. K. & Whitwham, A. Gap5--editing the billion fragment sequence assembly. Bioinformatics 26, 1699-1703, doi:10.1093/bioinformatics/btq268 (2010).

105 90 Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic acids research 30, 2478-2483 (2002). 91 Parra, G., Bradnam, K., Ning, Z., Keane, T. & Korf, I. Assessing the gene space in draft genomes. Nucleic acids research 37, 289-297, doi:10.1093/nar/gkn916 (2009). 92 Cotton, J. A. et al. The genome of Onchocerca volvulus, agent of river blindness. Nat Microbiol 2, 16216, doi:10.1038/nmicrobiol.2016.216 (2016). 93 Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics 47, 11 12 11-34, doi:10.1002/0471250953.bi1112s47 (2014). 94 Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics 12, 491, doi:10.1186/1471-2105-12-491 (2011). 95 Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic acids research 34, W435-439, doi:10.1093/nar/gkl200 (2006). 96 Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y. O. & Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 18, 1979-1990, doi:10.1101/gr.081612.108 (2008). 97 Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 59, doi:10.1186/1471-2105- 5-59 (2004). 98 She, R. et al. genBlastG: using BLAST searches to build homologous gene models. Bioinformatics 27, 2141-2143, doi:10.1093/bioinformatics/btr342 (2011). 99 Yook, K. et al. WormBase 2012: more genomes, more data, new website. Nucleic acids research 40, D735-741, doi:10.1093/nar/gkr954 (2012). 100 Otto, T. D., Dillon, G. P., Degrave, W. S. & Berriman, M. RATT: Rapid Annotation Transfer Tool. Nucleic acids research 39, e57, doi:10.1093/nar/gkq1268 (2011). 101 Cochrane, G., Karsch-Mizrachi, I., Takagi, T. & International Nucleotide Sequence Database, C. The International Nucleotide Sequence Database Collaboration. Nucleic acids research 44, D48-50, doi:10.1093/nar/gkv1323 (2016). 102 Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-3402 (1997). 103 Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC bioinformatics 6, 31, doi:10.1186/1471-2105-6-31 (2005). 104 UniProt, C. UniProt: a hub for protein information. Nucleic acids research 43, D204-212, doi:10.1093/nar/gku989 (2015). 105 Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-1067, doi:10.1093/bioinformatics/btm071 (2007). 106 Mitreva, M. et al. The draft genome of the parasitic nematode Trichinella spiralis. Nat Genet 43, 228-235, doi:10.1038/ng.769 (2011). 107 Li, L., Stoeckert, C. J., Jr. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178-2189, doi:10.1101/gr.1224503 (2003). 108 Logan-Klumpler, F. J. et al. GeneDB--an annotation database for pathogens. Nucleic acids research 40, D98-108, doi:10.1093/nar/gkr1032 (2012). 109 Dodt, M., Roehr, J. T., Ahmed, R. & Dieterich, C. FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms. Biology (Basel) 1, 895-905, doi:10.3390/biology1030895 (2012). 110 Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, doi:10.1093/bioinformatics/btu170 (2014). 111 Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005). 112 Xue, W. et al. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 14, 604, doi:10.1186/1471-2164-14-604 (2013).

106 113 Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 18, 810-820, doi:10.1101/gr.7337908 (2008). 114 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009). 115 Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664, doi:10.1101/gr.229202. Article published online before March 2002 (2002). 116 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature methods 9, 357-359, doi:10.1038/nmeth.1923 (2012). 117 Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 14, R36, doi:10.1186/gb-2013-14-4-r36 (2013). 118 Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644-652, doi:10.1038/nbt.1883 (2011). 119 Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics 28, 3150-3152, doi:10.1093/bioinformatics/bts565 (2012). 120 Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955-964 (1997). 121 Nawrocki, E. P. et al. 12.0: updates to the RNA families database. Nucleic acids research 43, D130-137, doi:10.1093/nar/gku1063 (2015). 122 Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188-196, doi:10.1101/gr.6743907 (2008). 123 Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research 28, 45-48 (2000). 124 Finn, R. D. et al. Pfam: the protein families database. Nucleic acids research 42, D222-230, doi:10.1093/nar/gkt1223 (2014). 125 Marchler-Bauer, A. et al. CDD: NCBI's conserved domain database. Nucleic acids research 43, D222-226, doi:10.1093/nar/gku1221 (2015). 126 Kanehisa, M. The KEGG database. Found Symp 247, 91-101; discussion 101- 103, 119-128, 244-152 (2002). 127 Kumar, S. & Blaxter, M. L. Simultaneous genome sequencing of symbionts and their hosts. Symbiosis 55, 119-126, doi:10.1007/s13199-012-0154-6 (2011). 128 Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res 19, 1117-1123 (2009). 129 Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8, 1494-1512, doi:10.1038/nprot.2013.084 (2013). 130 Darby, A. C. et al. Analysis of gene expression from the Wolbachia genome of a filarial nematode supports both metabolic and defensive roles within the symbiosis. Genome Res 22, 2467-2477, doi:10.1101/gr.138420.112 (2012). 131 Hunter, S. et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research 40, D306-312, doi:10.1093/nar/gkr948 (2012). 132 Gene Ontology, C. The Gene Ontology in 2010: extensions and refinements. Nucleic acids research 38, D331-335, doi:10.1093/nar/gkp1018 (2010). 133 Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236-1240, doi:10.1093/bioinformatics/btu031 (2014). 134 Howe, K. L. et al. WormBase 2016: expanding to enable helminth genomic research. Nucleic acids research 44, D774-780, doi:10.1093/nar/gkv1217 (2016). 135 Howe, K. L., Bolt, B. J., Shafie, M., Kersey, P. & Berriman, M. WormBase ParaSite - a comprehensive resource for helminth genomics. Molecular and biochemical parasitology 215, 2-10, doi:10.1016/j.molbiopara.2016.11.005 (2017).

107 136 Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9, 18, doi:10.1186/1471-2105- 9-18 (2008). 137 Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic acids research 44, D279-285, doi:10.1093/nar/gkv1344 (2016). 138 Steinbiss, S., Willhoeft, U., Gremme, G. & Kurtz, S. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic acids research 37, 7002- 7013, doi:10.1093/nar/gkp759 (2009). 139 Llorens, C. et al. The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic acids research 39, D70-74, doi:10.1093/nar/gkq1061 (2011). 140 Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-2461, doi:10.1093/bioinformatics/btq461 (2010). 141 Hadfield, J. D. MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package. Journal of Statistical Software 33, 1-22, doi:10.18637/jss.v033.i02 (2010). 142 Britton, T., Anderson, C. L., Jacquet, D., Lundqvist, S. & Bremer, K. Estimating divergence times in large phylogenetic trees. Syst Biol 56, 741-752, doi:10.1080/10635150701613783 (2007). 143 Hahn, C., Bachmann, L. & Chevreux, B. Reconstructing mitochondrial genomes directly from genomic next-generation sequencing reads--a baiting and iterative mapping approach. Nucleic acids research 41, e129, doi:10.1093/nar/gkt371 (2013). 144 Bernt, M. et al. MITOS: improved de novo metazoan mitochondrial genome annotation. Molecular phylogenetics and evolution 69, 313-319, doi:10.1016/j.ympev.2012.08.023 (2013). 145 Carver, T., Harris, S. R., Berriman, M., Parkhill, J. & McQuillan, J. A. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28, 464-469, doi:10.1093/bioinformatics/btr703 (2012). 146 Armstrong, M. R., Blok, V. C. & Phillips, M. S. A multipartite mitochondrial genome in the potato cyst nematode Globodera pallida. 154, 181-192 (2000). 147 Katoh, K. & Standley, D. M. MAFFT: iterative refinement and additional methods. Methods Mol Biol 1079, 131-146, doi:10.1007/978-1-62703-646-7_8 (2014). 148 Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17, 540-552 (2000). 149 Darriba, D., Taboada, G. L., Doallo, R. & Posada, D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27, 1164-1165, doi:10.1093/bioinformatics/btr088 (2011). 150 Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312-1313, doi:10.1093/bioinformatics/btu033 (2014). 151 Johnson, M. et al. NCBI BLAST: a better web interface. Nucleic acids research 36, W5-9, doi:10.1093/nar/gkn201 (2008). 152 Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics 5, 113, doi:10.1186/1471-2105-5-113 (2004). 153 Tatusova, T. Update on Genomic Databases and Resources at the National Center for Biotechnology Information. Methods Mol Biol 1415, 3-30, doi:10.1007/978-1-4939-3572- 7_1 (2016). 154 Laetsch, D. R. & Blaxter, M. L. KinFin: Software for taxon-aware analysis of clustered protein sequences. bioRxiv, doi:10.1101/159145 (2017). 155 Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972-1973, doi:10.1093/bioinformatics/btp348 (2009).

108 156 Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PloS one 9, e98679, doi:10.1371/journal.pone.0098679 (2014). 157 Dunn, C. W. et al. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452, 745-749, doi:10.1038/nature06614 (2008). 158 Jones, A. K., Davis, P., Hodgkin, J. & Sattelle, D. B. The nicotinic acetylcholine receptor gene family of the nematode Caenorhabditis elegans: an update on nomenclature. Invert Neurosci 7, 129-131, doi:10.1007/s10158-007-0049-z (2007). 159 Li, B. W., Rush, A. C. & Weil, G. J. Expression of five acetylcholine receptor subunit genes in Brugia malayi adult worms. Int J Parasitol Drugs Drug Resist 5, 100-109, doi:10.1016/j.ijpddr.2015.04.003 (2015). 160 Buxton, S. K. et al. Investigation of acetylcholine receptor diversity in a nematode parasite leads to characterization of - and derquantel-sensitive nAChRs. PLoS Pathog 10, e1003870, doi:10.1371/journal.ppat.1003870 (2014). 161 Ronquist, F. et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61, 539-542, doi:10.1093/sysbio/sys029 (2012). 162 Yu, G. C., Smith, D. K., Zhu, H. C., Guan, Y. & Lam, T. T. Y. GGTREE: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol 8, 28-36, doi:10.1111/2041-210x.12628 (2017). 163 Rawlings, N. D. & Morton, F. R. The MEROPS batch BLAST: a tool to detect peptidases and their non-peptidase homologues in a genome. Biochimie 90, 243-259, doi:10.1016/j.biochi.2007.09.014 (2008). 164 Finn, R. D. et al. The Pfam protein families database. Nucleic acids research 38, D211-222, doi:10.1093/nar/gkp985 (2010). 165 Miranda-Saavedra, D. & Barton, G. J. Classification and functional annotation of eukaryotic protein kinases. Proteins 68, 893-914, doi:10.1002/prot.21444 (2007). 166 Kall, L., Krogh, A. & Sonnhammer, E. L. Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server. Nucleic acids research 35, W429- 432, doi:10.1093/nar/gkm256 (2007). 167 Bendtsen, J. D., Jensen, L. J., Blom, N., Von Heijne, G. & Brunak, S. Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel 17, 349- 356, doi:10.1093/protein/gzh037 (2004). 168 Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic acids research 33, W116-120, doi:10.1093/nar/gki442 (2005). 169 Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2--approximately maximum-likelihood trees for large alignments. PloS one 5, e9490, doi:10.1371/journal.pone.0009490 (2010). 170 Soding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951- 960, doi:10.1093/bioinformatics/bti125 (2005). 171 Isberg, V. et al. GPCRdb: an information system for G protein-coupled receptors. Nucleic acids research 45, 2936, doi:10.1093/nar/gkw1218 (2017). 172 Jex, A. R. et al. Ascaris suum draft genome. Nature 479, 529-533, doi:10.1038/nature10553 (2011). 173 Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic acids research 42, D1083-1090, doi:10.1093/nar/gkt1031 (2014). 174 Zamanian, M. et al. The repertoire of G protein-coupled receptors in the Schistosoma mansoni and the model organism Schmidtea mediterranea. BMC Genomics 12, 596, doi:10.1186/1471-2164-12-596 (2011). 175 Hung, S. S., Wasmuth, J., Sanford, C. & Parkinson, J. DETECT--a density estimation tool for enzyme classification and its application to Plasmodium falciparum. Bioinformatics 26, 1690-1698, doi:10.1093/bioinformatics/btq266 (2010).

109 176 Claudel-Renard, C., Chevalet, C., Faraut, T. & Kahn, D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic acids research 31, 6633-6639 (2003). 177 Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A. C. & Kanehisa, M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic acids research 35, W182- 185, doi:10.1093/nar/gkm321 (2007). 178 Chang, A. et al. BRENDA in 2015: exciting developments in its 25th year of existence. Nucleic acids research 43, D439-446, doi:10.1093/nar/gku1068 (2015). 179 Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic acids research 40, D742-753, doi:10.1093/nar/gkr1014 (2012). 180 Kumar, N. & Skolnick, J. EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics 28, 2687-2688, doi:10.1093/bioinformatics/bts510 (2012). 181 Tyagi, R., Rosa, B. A., Lewis, W. G. & Mitreva, M. Pan-phylum Comparison of Nematode Metabolic Potential. PLoS neglected tropical diseases 9, e0003788, doi:10.1371/journal.pntd.0003788 (2015). 182 Real, R. & Vargas, J. M. The probabilistic basis of Jaccard's index of similarity. Systematic Biology 45, 380-385, doi:Doi 10.2307/2413572 (1996). 183 Taylor, C. M. et al. Discovery of anthelmintic drug targets and drugs using chokepoints in nematode metabolic pathways. PLoS Pathog 9, e1003505, doi:10.1371/journal.ppat.1003505 (2013). 184 Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M. & Henrissat, B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic acids research 42, D490- 495, doi:10.1093/nar/gkt1178 (2014). 185 Anand, N. & Sharma, S. (Elsevier Science, 1997). 186 Elks, J. 2062 (Springer US, 1990). 187 Marr, J. J., Nilsen, T. & Komuniecki, R. (Academic Press, 2003). 188 Moffat, A., Osselton, M. D. & Widdop, B. (Pharmaceutical Press, 2011). 189 Lewis, R. A. (CRC Press, 1998). 190 Oliver-Bever, B. Medicinal plants in tropical West Africa. III. Anti-infection therapy with higher plants. J Ethnopharmacol 9, 1-83 (1983). 191 Grieve, M. A Modern Herbal: The Complete Edition. (Stone Basin Books, 2015). 192 Allegretti, S. M. et al. in Schistosomiasis (InTech, 2012). 193 Holden-Dye, L. & Walker, R. J. Anthelmintic drugs and nematicides: studies in Caenorhabditis elegans. WormBook, 1-29, doi:10.1895/wormbook.1.143.2 (2014). 194 Kim, S. et al. PubChem Substance and Compound databases. Nucleic acids research 44, D1202-1213, doi:10.1093/nar/gkv951 (2016). 195 Kanehisa, M. Molecular network analysis of diseases and drugs in KEGG. Methods Mol Biol 939, 263-275, doi:10.1007/978-1-62703-107-3_17 (2013). 196 Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research 44, D1214-1219, doi:10.1093/nar/gkv1031 (2016). 197 Gaulton, A. et al. The ChEMBL database in 2017. Nucleic acids research 45, D945-D954, doi:10.1093/nar/gkw1074 (2017). 198 Law, V. et al. DrugBank 4.0: shedding new light on . Nucleic acids research 42, D1091-1097, doi:10.1093/nar/gkt1068 (2014). 199 Lowe, D. M., Corbett, P. T., Murray-Rust, P. & Glen, R. C. Chemical Name to Structure: OPSIN, an Open Source Solution. Journal of Chemical Information and Modeling 51, 739- 753, doi:10.1021/ci100384d (2011). 200 Dalke, A. chemfp, (2013).

110 201 Millburn, G. H., Crosby, M. A., Gramates, L. S., Tweedie, S. & FlyBase, C. FlyBase portals to human disease research using Drosophila models. Dis Model Mech 9, 245-252, doi:10.1242/dmm.023317 (2016). 202 Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562-578, doi:10.1038/nprot.2012.016 (2012). 203 Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework to work with high- throughput sequencing data. Bioinformatics 31, 166-169, doi:10.1093/bioinformatics/btu638 (2015). 204 Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res 22, 2008-2017, doi:10.1101/gr.133744.111 (2012). 205 Velankar, S. et al. PDBe: improved accessibility of macromolecular structure data from PDB and EMDB. Nucleic acids research 44, D385-395, doi:10.1093/nar/gkv1047 (2016). 206 Capra, J. A. & Singh, M. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875-1882, doi:10.1093/bioinformatics/btm270 (2007). 207 Oprea, T. I. & Overington, J. P. Computational and Practical Aspects of Drug Repositioning. Assay Drug Dev Technol 13, 299-306, doi:10.1089/adt.2015.29011.tiodrrr (2015). 208 Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat Chem 4, 90-98, doi:10.1038/nchem.1243 (2012). 209 Aynaud, T. Community detection for NetworkX’s documentation, (2009). 210 Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J Stat Mech-Theory E, doi:Artn P10008 10.1088/1742-5468/2008/10/P10008 (2008). 211 Sterling, T. & Irwin, J. J. ZINC 15--Ligand Discovery for Everyone. J Chem Inf Model 55, 2324-2337, doi:10.1021/acs.jcim.5b00559 (2015). 212 Wehrens, R. & Buydens, L. M. C. Self- and Super-organizing Maps in R: The kohonen Package. Journal of Statistical Software 21 (2007). 213 Berriman, M. et al. The genome of the blood fluke Schistosoma mansoni. Nature 460, 352- 358, doi:10.1038/nature08160 (2009). 214 Burke, M. et al. The plant parasite Pratylenchus coffeae carries a minimal nematode genome. Nematology 17, 621-637, doi:10.1163/15685411-00002901 (2015). 215 Bennett, H. M. et al. The genome of the tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion. Genome biology 15, 510, doi:10.1186/PREACCEPT-2413673241432389 (2014). 216 Gregory, T. R., Hebert, P. D. & Kolasa, J. Evolutionary implications of the relationship between genome size and body size in flatworms and . Heredity (Edinb) 84 ( Pt 2), 201-208 (2000). 217 Thomas, J. H. & Robertson, H. M. The Caenorhabditis chemoreceptor gene families. BMC Biol 6, 42, doi:10.1186/1741-7007-6-42 (2008). 218 Coghlan, A. et al. nGASP--the nematode genome annotation assessment project. BMC bioinformatics 9, 549, doi:10.1186/1471-2105-9-549 (2008). 219 Tsai, I. J. et al. The genomes of four tapeworm species reveal adaptations to parasitism. Nature 496, 57-63, doi:10.1038/nature12031 (2013). 220 Lunt, D. H., Kumar, S., Koutsovoulos, G. & Blaxter, M. L. The complex hybrid origins of the root knot nematodes revealed through comparative genomics. PeerJ 2, e356, doi:10.7717/peerj.356 (2014). 221 Abad, P. et al. Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nat Biotechnol 26, 909-915, doi:10.1038/nbt.1482 (2008).

111 222 Castagnone-Sereno, P., Danchin, E. G., Perfus-Barbeoch, L. & Abad, P. Diversity and evolution of root-knot nematodes, genus Meloidogyne: new insights from the genomic era. Annu Rev Phytopathol 51, 203-220, doi:10.1146/annurev-phyto-082712-102300 (2013). 223 Detwiler, J. T. & Criscione, C. D. An infectious topic in reticulate evolution: introgression and hybridization in animal parasites. Genes (Basel) 1, 102-123, doi:10.3390/genes1010102 (2010). 224 Cwiklinski, K. et al. The Fasciola hepatica genome: gene duplication and polymorphism reveals adaptation to the host environment and the capacity for rapid evolution. Genome biology 16, 71, doi:10.1186/s13059-015-0632-2 (2015). 225 Schiffer, P. H. et al. The genome of Romanomermis culicivorax: revealing fundamental changes in the core developmental genetic toolkit in Nematoda. BMC Genomics 14, 923, doi:10.1186/1471-2164-14-923 (2013). 226 Muller, F., Bernard, V. & Tobler, H. Chromatin diminution in nematodes. Bioessays 18, 133- 138, doi:10.1002/bies.950180209 (1996). 227 Subirana, J. A. & Messeguer, X. A satellite explosion in the genome of holocentric nematodes. PloS one 8, e62221, doi:10.1371/journal.pone.0062221 (2013). 228 Streeck, R. E., Moritz, K. B. & Beer, K. Chromatin diminution in Ascaris suum: nucleotide sequence of the eliminated satellite DNA. Nucleic acids research 10, 3495-3502 (1982). 229 Wang, J. et al. Silencing of germline-expressed genes by DNA elimination in somatic cells. Dev Cell 23, 1072-1080, doi:10.1016/j.devcel.2012.09.020 (2012). 230 Pimpinelli, S. & Goday, C. Unusual kinetochores and chromatin diminution in Parascaris. Trends Genet 5, 310-315 (1989). 231 Szitenberg, A. et al. Genetic Drift, Not Life History or RNAi, Determine Long-Term Evolution of Transposable Elements. Genome Biol Evol 8, 2964-2978, doi:10.1093/gbe/evw208 (2016). 232 Lavrov, D. V. & Brown, W. M. Trichinella spiralis mtDNA: a nematode mitochondrial genome that encodes a putative ATP8 and normally structured tRNAS and has a gene arrangement relatable to those of coelomate metazoans. Genetics 157, 621-637 (2001). 233 Gissi, C., Iannelli, F. & Pesole, G. Evolution of the mitochondrial genome of Metazoa as exemplified by comparison of congeneric species. Heredity (Edinb) 101, 301-320, doi:10.1038/hdy.2008.62 (2008). 234 Wolstenholme, D. R. Animal mitochondrial DNA: structure and evolution. Int Rev Cytol 141, 173-216 (1992). 235 Brabec, J., Kostadinova, A., Scholz, T. & Littlewood, D. T. Complete mitochondrial genomes and nuclear ribosomal RNA operons of two species of Diplostomum (Platyhelminthes: Trematoda): a molecular resource for taxonomy and molecular epidemiology of important fish pathogens. Parasit Vectors 8, 336, doi:10.1186/s13071-015- 0949-4 (2015). 236 Olson, P. D., Zarowiecki, M., Kiss, F. & Brehm, K. Cestode genomics - progress and prospects for advancing basic and applied aspects of flatworm biology. Parasite Immunol 34, 130-150, doi:10.1111/j.1365-3024.2011.01319.x (2012). 237 Park, J. K. et al. Monophyly of clade III nematodes is not supported by phylogenetic analysis of complete mitochondrial genome sequences. BMC Genomics 12, 392, doi:10.1186/1471-2164-12-392 (2011). 238 Xu, W., Jameson, D., Tang, B. & Higgs, P. G. The relationship between the rate of molecular evolution and the rate of genome rearrangement in animal mitochondrial genomes. J Mol Evol 63, 375-392, doi:10.1007/s00239-005-0246-5 (2006). 239 Bernt, M. et al. A comprehensive analysis of bilaterian mitochondrial genomes and phylogeny. Molecular phylogenetics and evolution 69, 352-364, doi:10.1016/j.ympev.2013.05.002 (2013).

112 240 Le, T. H., Blair, D. & McManus, D. P. Mitochondrial genomes of parasitic flatworms. Trends Parasitol 18, 206-213 (2002). 241 Laetsch, D. R., Heitlinger, E. G., Taraschewski, H., Nadler, S. A. & Blaxter, M. L. The phylogenetics of Anguillicolidae (Nematoda: Anguillicoloidea), swimbladder parasites of eels. BMC Evol Biol 12, 60, doi:10.1186/1471-2148-12-60 (2012). 242 Jordanova, R. et al. Fatty acid- and retinoid-binding proteins have distinct binding pockets for the two types of cargo. J Biol Chem 284, 35818-35826, doi:10.1074/jbc.M109.022731 (2009). 243 Uccelletti, D. et al. APY-1, a novel Caenorhabditis elegans apyrase involved in unfolded protein response signalling and stress responses. Mol Biol Cell 19, 1337-1345, doi:10.1091/mbc.E07-06-0547 (2008). 244 Valenzuela, J. G., Charlab, R., Galperin, M. Y. & Ribeiro, J. M. Purification, cloning, and expression of an apyrase from the bed bug Cimex lectularius. A new type of nucleotide- binding enzyme. J Biol Chem 273, 30583-30590 (1998). 245 Nisbet, A. J. et al. A calcium-activated apyrase from Teladorsagia circumcincta: an excretory/secretory antigen capable of modulating host immune responses? Parasite Immunol 33, 236-243, doi:10.1111/j.1365-3024.2011.01278.x (2011). 246 Hewitson, J. P. et al. Secretion of protective antigens by tissue-stage nematode larvae revealed by proteomic analysis and vaccination-induced sterile immunity. PLoS Pathog 9, e1003492, doi:10.1371/journal.ppat.1003492 (2013). 247 Tsugeno, Y. & Ito, A. A key amino acid responsible for substrate selectivity of monoamine oxidase A and B. J Biol Chem 272, 14033-14036 (1997). 248 Smolinska, S., Jutel, M., Crameri, R. & O'Mahony, L. Histamine and gut mucosal immune regulation. Allergy 69, 273-281, doi:10.1111/all.12330 (2014). 249 Guevara-Flores, A. et al. 5'-p-Fluorosulfonyl benzoyl inhibits an ecto-ATP- diphosphohydrolase in the tegument surface of Taenia crassiceps cysticerci. Molecular and biochemical parasitology 162, 123-133, doi:10.1016/j.molbiopara.2008.08.002 (2008). 250 Wu, C. et al. Non-immune immunoglobulins shield Schistosoma japonicum from host immunorecognition. Sci Rep 5, 13434, doi:10.1038/srep13434 (2015). 251 Courtney, H. S. & Li, Y. Non-immune binding of human IgG to M-related proteins confers resistance to phagocytosis of group A streptococci in blood. PloS one 8, e78719, doi:10.1371/journal.pone.0078719 (2013). 252 Silva, L. L. et al. The Schistosoma mansoni phylome: using evolutionary genomics to gain insight into a parasite's biology. BMC Genomics 13, 617, doi:10.1186/1471-2164-13-617 (2012). 253 Cabral, G. A. Lipids as bioeffectors in the immune system. Life Sci 77, 1699-1710, doi:10.1016/j.lfs.2005.05.013 (2005). 254 Basavaraju, S. V. et al. Ac-FAR-1, a 20 kDa fatty acid- and retinol-binding protein secreted by adult Ancylostoma caninum hookworms: gene transcription pattern, ligand binding properties and structural characterisation. Molecular and biochemical parasitology 126, 63- 71 (2003). 255 Garofalo, A. et al. The FAR proteins of filarial nematodes: secretion, glycosylation and lipid binding characteristics. Molecular and biochemical parasitology 122, 161-170 (2002). 256 Schwarz, E. M. et al. The genome and developmental transcriptome of the strongylid nematode Haemonchus contortus. Genome biology 14, R89, doi:10.1186/gb-2013-14-8-r89 (2013). 257 Vanholme, B. et al. Detection of putative secreted proteins in the plant-parasitic nematode Heterodera schachtii. Parasitol Res 98, 414-424, doi:10.1007/s00436-005-0029-3 (2006). 258 Bosman, F. T. & Stamenkovic, I. Functional structure and composition of the extracellular matrix. J Pathol 200, 423-428, doi:10.1002/path.1437 (2003).

113 259 Varki, A. Biological roles of glycans. Glycobiology 27, 3-49, doi:10.1093/glycob/cww086 (2017). 260 Varki, A. & Lowe, J. B. in Essentials of Glycobiology (eds A. Varki et al.) (2009). 261 Kaneiwa, T. et al. Identification of a novel chondroitin hydrolase in Caenorhabditis elegans. J Biol Chem 283, 14971-14979, doi:10.1074/jbc.M709236200 (2008). 262 Irwin, J. A. et al. Glycosidase activity in the excretory-secretory products of the liver fluke, Fasciola hepatica. Parasitology 129, 465-472 (2004). 263 Bruschi, F. & Chiumiento, L. Trichinella inflammatory myopathy: host or parasite strategy? Parasit Vectors 4, 42, doi:10.1186/1756-3305-4-42 (2011). 264 Tilney, L. G., Connelly, P. S., Guild, G. M., Vranich, K. A. & Artis, D. Adaptation of a nematode parasite to living within the mammalian epithelium. J Exp Zool A Comp Exp Biol 303, 927-945, doi:10.1002/jez.a.214 (2005). 265 Marchant, J. et al. Galactose recognition by the apicomplexan parasite Toxoplasma gondii. J Biol Chem 287, 16720-16733, doi:10.1074/jbc.M111.325928 (2012). 266 Diaz, A. et al. Understanding the laminated layer of larval Echinococcus I: structure. Trends Parasitol 27, 204-213, doi:10.1016/j.pt.2010.12.012 (2011). 267 Arana, Y. et al. Characterization of the carbohydrate components of Taenia solium oncosphere proteins and their role in the antigenicity. Parasitol Res 112, 3569-3578, doi:10.1007/s00436-013-3542-9 (2013). 268 Davies, K. G. Understanding the interaction between an obligate hyperparasitic bacterium, Pasteuria penetrans and its obligate plant-parasitic nematode host, Meloidogyne spp. Adv Parasitol 68, 211-245, doi:10.1016/S0065-308X(08)00609-X (2009). 269 Maruyama, H., El-Malky, M., Kumagai, T. & Ohta, N. Secreted adhesion molecules of Strongyloides venezuelensis are produced by oesophageal glands and are components of the wall of tunnels constructed by adult worms in the host intestinal mucosa. Parasitology 126, 165-171 (2003). 270 Fairbairn, D. Biochemical adaptation and loss of genetic capacity in helminth parasites. Biol Rev Camb Philos Soc 45, 29-72 (1970). 271 Murray, P., Rosenthal, K. & Pfaller, M. Medical Microbiology. 5th edn, (Mosby, 2005). 272 Alper, S., McBride, S. J., Lackford, B., Freedman, J. H. & Schwartz, D. A. Specificity and complexity of the Caenorhabditis elegans innate immune response. Mol Cell Biol 27, 5544- 5553, doi:10.1128/MCB.02070-06 (2007). 273 Schulenburg, H. & Boehnisch, C. Diversification and adaptive sequence evolution of Caenorhabditis lysozymes (Nematoda: Rhabditidae). BMC Evol Biol 8, 114, doi:10.1186/1471-2148-8-114 (2008). 274 Marsh, E. K., van den Berg, M. C. & May, R. C. A two-gene balance regulates Salmonella typhimurium tolerance in the nematode Caenorhabditis elegans. PloS one 6, e16839, doi:10.1371/journal.pone.0016839 (2011). 275 Boehnisch, C. et al. Protist-type lysozymes of the nematode Caenorhabditis elegans contribute to resistance against pathogenic Bacillus thuringiensis. PloS one 6, e24619, doi:10.1371/journal.pone.0024619 (2011). 276 Mulvenna, J. et al. Proteomics analysis of the excretory/secretory component of the blood- feeding stage of the , Ancylostoma caninum. Mol Cell Proteomics 8, 109-121, doi:10.1074/mcp.M800206-MCP200 (2009). 277 Chavez, V., Mohri-Shiomi, A. & Garsin, D. A. Ce-Duox1/BLI-3 generates reactive oxygen species as a protective innate immune mechanism in Caenorhabditis elegans. Infect Immun 77, 4983-4989, doi:10.1128/IAI.00627-09 (2009). 278 Walton, A. C. The oogenesis and early embryology of Ascaris canis. Journal of Morphology 30, 527-603, doi:10.1002/jmor.1050300207 (1918). 279 Graser, S., Stierhof, Y. D. & Nigg, E. A. Cep68 and Cep215 (Cdk5rap2) are required for centrosome cohesion. J Cell Sci 120, 4321-4331, doi:10.1242/jcs.020248 (2007).

114 280 Fisk Green, R., Lorson, M., Walhout, A. J., Vidal, M. & van den Heuvel, S. Identification of critical domains and putative partners for the Caenorhabditis elegans spindle component LIN-5. Mol Genet Genomics 271, 532-544, doi:10.1007/s00438-004-1012-x (2004). 281 Kikuchi, T. et al. Genomic insights into the origin of parasitism in the emerging plant Bursaphelenchus xylophilus. PLoS Pathog 7, e1002219, doi:10.1371/journal.ppat.1002219 (2011). 282 Perry, R. N. & Wharton, D. A. Molecular and Physiological Basis of Nematode Survival., (CABI, 2011). 283 Sonobe, H. & Ito, Y. Phosphoconjugation and dephosphorylation reactions of steroid hormone in insects. Mol Cell Endocrinol 307, 25-35, doi:10.1016/j.mce.2009.03.017 (2009). 284 Sonobe, H. et al. Purification, kinetic characterization, and molecular cloning of a novel enzyme, ecdysteroid 22-kinase. J Biol Chem 281, 29513-29524, doi:10.1074/jbc.M604035200 (2006). 285 Sluder, A. E. & Maina, C. V. Nuclear receptors in nematodes: themes and variations. Trends Genet 17, 206-213 (2001). 286 Tzertzinis, G. et al. Molecular evidence for a functional ecdysone signaling system in Brugia malayi. PLoS neglected tropical diseases 4, e625, doi:10.1371/journal.pntd.0000625 (2010). 287 Parihar, M. et al. The genome of the nematode Pristionchus pacificus encodes putative homologs of RXR/Usp and EcR. Gen Comp Endocrinol 167, 11-17, doi:10.1016/j.ygcen.2010.02.005 (2010). 288 Shea, C., Richer, J., Tzertzinis, G. & Maina, C. V. An EcR homolog from the filarial parasite, Dirofilaria immitis requires a ligand-activated partner for transactivation. Molecular and biochemical parasitology 171, 55-63, doi:10.1016/j.molbiopara.2010.02.002 (2010). 289 Graham, L. D., Kotze, A. C., Fernley, R. T. & Hill, R. J. An ortholog of the ecdysone receptor protein (EcR) from the parasitic nematode Haemonchus contortus. Molecular and biochemical parasitology 171, 104-107, doi:10.1016/j.molbiopara.2010.03.003 (2010). 290 Zhao, L. et al. Chemical signals synchronize the life cycles of a plant-parasitic nematode and its vector . Current biology : CB 23, 2038-2043, doi:10.1016/j.cub.2013.08.041 (2013). 291 O'Donovan, A., Davies, A. A., Moggs, J. G., West, S. C. & Wood, R. D. XPG endonuclease makes the 3' incision in human DNA nucleotide excision repair. Nature 371, 432-435, doi:10.1038/371432a0 (1994). 292 Habraken, Y., Sung, P., Prakash, L. & Prakash, S. Yeast excision repair gene RAD2 encodes a single-stranded DNA endonuclease. Nature 366, 365-368, doi:10.1038/366365a0 (1993). 293 Eyboulet, F. et al. Mediator links transcription and DNA repair by facilitating Rad2/XPG recruitment. Genes Dev 27, 2549-2562, doi:10.1101/gad.225813.113 (2013). 294 Ruelas, D. S., Karentz, D. & Sullivan, J. T. Sublethal effects of ultraviolet b radiation on miracidia and sporocysts of Schistosoma mansoni: intramolluscan development, infectivity, and photoreactivation. J Parasitol 93, 1303-1310, doi:10.1645/GE-1227.1 (2007). 295 Silva, C. S. et al. Schistosoma mansoni: gene expression of the nucleotide excision repair factor 2 (NEF2) during the parasite life cycle, and in adult worms after exposure to different DNA-damaging agents. Acta Trop 104, 52-62, doi:10.1016/j.actatropica.2007.07.006 (2007). 296 Studer, A., Lamare, M. D. & Poulin, R. Effects of ultraviolet radiation on the transmission process of an intertidal trematode parasite. Parasitology 139, 537-546, doi:10.1017/S0031182011002174 (2012). 297 Robellet, X., Flipphi, M., Pegot, S., Maccabe, A. P. & Velot, C. AcpA, a member of the GPR1/FUN34/YaaH membrane protein family, is essential for acetate permease activity in

115 the hyphal fungus Aspergillus nidulans. Biochem J 412, 485-493, doi:10.1042/BJ20080124 (2008). 298 Sa-Pessoa, J. et al. SATP (YaaH), a succinate-acetate transporter protein in Escherichia coli. Biochem J 454, 585-595, doi:10.1042/BJ20130412 (2013). 299 Goodenough, U. et al. The path to triacylglyceride obesity in the sta6 strain of Chlamydomonas reinhardtii. Eukaryot Cell 13, 591-613, doi:10.1128/EC.00013-14 (2014). 300 Kmetec, E. & Bueding, E. Production of succinate by the canine whipworm Trichuris vulpis. Comp Biochem Physiol 15, 271-274 (1965). 301 Palkova, Z. et al. Ammonia pulses and metabolic oscillations guide yeast colony development. Mol Biol Cell 13, 3901-3914, doi:10.1091/mbc.E01-12-0149 (2002). 302 Bader, G. et al. Crystal structure of rat GTP cyclohydrolase I feedback regulatory protein, GFRP. J Mol Biol 312, 1051-1057, doi:10.1006/jmbi.2001.5011 (2001). 303 Loer, C. M. et al. Cuticle integrity and biogenic amine synthesis in Caenorhabditis elegans require the cofactor tetrahydrobiopterin (BH4). Genetics 200, 237-253, doi:10.1534/genetics.114.174110 (2015). 304 Funderburk, C. D., Bowling, K. M., Xu, D., Huang, Z. & O'Donnell, J. M. A typical N-terminal extensions confer novel regulatory properties on GTP cyclohydrolase isoforms in Drosophila melanogaster. J Biol Chem 281, 33302-33312, doi:10.1074/jbc.M602196200 (2006). 305 Blaxter, M. & Koutsovoulos, G. The evolution of parasitism in Nematoda. Parasitology 142 Suppl 1, S26-39, doi:10.1017/S0031182014000791 (2015). 306 van Megen, H. et al. A phylogenetic tree of nematodes based on about 1200 full-length small subunit ribosomal DNA sequences. Nematology 11, 927-950, doi:10.1163/156854109X456862 (2009). 307 De Ley, P. & Blaxter, M. in The Biology of Nematodes. (ed D. L. Lee) Ch. 1, 1-30 (CRC Press, 2002). 308 DeCosky, A. A journal interview with Dr. Ronald L. Occhionero. Ohio Dent J 51, 28-33 (1977). 309 Lockyer, A. E., Olson, P. D. & Littlewood, D. T. J. Utility of complete large and small subunit rRNA genes in resolving the phylogeny of the Neodermata (Platyhelminthes): implications and a review of the cercomer theory. Biol J Linn Soc 78, 155-171, doi:DOI 10.1046/j.1095- 8312.2003.00141.x (2003). 310 Hahn, C., Fromm, B. & Bachmann, L. Comparative genomics of flatworms (platyhelminthes) reveals shared genomic features of ecto- and endoparastic neodermata. Genome Biol Evol 6, 1105-1117, doi:10.1093/gbe/evu078 (2014). 311 Justine, J. L. Non-monophyly of the monogeneans? International journal for parasitology 28, 1653-1657 (1998). 312 Olson, P. D. & Littlewood, D. T. Phylogenetics of the Monogenea--evidence from a medley of molecules. International journal for parasitology 32, 233-244 (2002). 313 Chalmers, I. W. & Hoffmann, K. F. Platyhelminth Venom Allergen-Like (VAL) proteins: revealing structural diversity, class-specific features and biological associations across the phylum. Parasitology 139, 1231-1245, doi:10.1017/S0031182012000704 (2012). 314 Hawdon, J. M., Jones, B. F., Hoffman, D. R. & Hotez, P. J. Cloning and characterization of Ancylostoma-secreted protein. A novel protein associated with the transition to parasitism by infective hookworm larvae. J Biol Chem 271, 6672-6678 (1996). 315 Hawdon, J. M., Narasimhan, S. & Hotez, P. J. Ancylostoma secreted protein 2: cloning and characterization of a second member of a family of nematode secreted proteins from Ancylostoma caninum. Molecular and biochemical parasitology 99, 149-165 (1999). 316 Cantacessi, C. et al. Insights into SCP/TAPS proteins of liver flukes based on large-scale bioinformatic analyses of sequence datasets. PloS one 7, e31164, doi:10.1371/journal.pone.0031164 (2012).

116 317 Asojo, O. A. et al. Crystallization and preliminary X-ray analysis of Na-ASP-1, a multi- domain pathogenesis-related-1 protein from the human hookworm parasite Necator americanus. Acta Crystallogr Sect F Struct Biol Cryst Commun 61, 391-394, doi:10.1107/S1744309105007748 (2005). 318 Curwen, R. S., Ashton, P. D., Sundaralingam, S. & Wilson, R. A. Identification of novel proteases and immunomodulators in the secretions of schistosome cercariae that facilitate host entry. Mol Cell Proteomics 5, 835-844, doi:10.1074/mcp.M500313-MCP200 (2006). 319 Bower, M. A., Constant, S. L. & Mendez, S. Necator americanus: the Na-ASP-2 protein secreted by the infective larvae induces neutrophil recruitment in vivo and in vitro. Exp Parasitol 118, 569-575, doi:10.1016/j.exppara.2007.11.014 (2008). 320 Bethony, J. et al. Antibodies against a secreted protein from hookworm larvae reduce the intensity of in humans and vaccinated laboratory animals. FASEB J 19, 1743-1745, doi:10.1096/fj.05-3936fje (2005). 321 Loukas, A., Bethony, J., Brooker, S. & Hotez, P. Hookworm vaccines: past, present, and future. Lancet Infect Dis 6, 733-741, doi:10.1016/S1473-3099(06)70630-2 (2006). 322 Mendez, S., A, D. S., Antoine, A. D., Ahn, S. & Hotez, P. Use of the air pouch model to investigate immune responses to a containing the Na-ASP-2 protein in rats. Parasite Immunol 30, 53-56, doi:10.1111/j.1365-3024.2007.00994.x (2008). 323 Xiao, S. et al. The evaluation of recombinant hookworm antigens as vaccines in hamsters (Mesocricetus auratus) challenged with human hookworm, Necator americanus. Exp Parasitol 118, 32-40, doi:10.1016/j.exppara.2007.05.010 (2008). 324 Tang, Y. T. et al. Genome of the human hookworm Necator americanus. Nat Genet 46, 261-269, doi:10.1038/ng.2875 (2014). 325 Schwarz, E. M. et al. The genome and transcriptome of the zoonotic hookworm Ancylostoma ceylanicum identify infection-specific gene families. Nat Genet 47, 416-422, doi:10.1038/ng.3237 (2015). 326 Hewitson, J. P., Grainger, J. R. & Maizels, R. M. Helminth immunoregulation: the role of parasite secreted proteins in modulating host immunity. Molecular and biochemical parasitology 167, 1-11, doi:10.1016/j.molbiopara.2009.04.008 (2009). 327 Hotez, P. J. & Cerami, A. Secretion of a proteolytic anticoagulant by Ancylostoma hookworms. J Exp Med 157, 1594-1603 (1983). 328 Williamson, A. L., Brindley, P. J., Knox, D. P., Hotez, P. J. & Loukas, A. Digestive proteases of blood-feeding nematodes. Trends Parasitol 19, 417-423 (2003). 329 Soblik, H. et al. Life cycle stage-resolved proteomic analysis of the excretome/secretome from Strongyloides ratti--identification of stage-specific proteases. Mol Cell Proteomics 10, M111 010157, doi:10.1074/mcp.M111.010157 (2011). 330 Feng, J. et al. Molecular cloning and characterization of Ac-MTP-2, an astacin-like metalloprotease released by adult Ancylostoma caninum. Molecular and biochemical parasitology 152, 132-138, doi:10.1016/j.molbiopara.2007.01.001 (2007). 331 Gomez Gallego, S. et al. Identification of an astacin-like metallo-proteinase transcript from the infective larvae of Strongyloides stercoralis. Parasitol Int 54, 123-133, doi:10.1016/j.parint.2005.02.002 (2005). 332 Dvorak, J. et al. Differential use of protease families for invasion by schistosome cercariae. Biochimie 90, 345-358, doi:10.1016/j.biochi.2007.08.013 (2008). 333 Robinson, M. W., Dalton, J. P. & Donnelly, S. Helminth pathogen cathepsin proteases: it's a family affair. Trends Biochem Sci 33, 601-608, doi:10.1016/j.tibs.2008.09.001 (2008). 334 Wilson, L. R. et al. Fasciola hepatica: characterization and cloning of the major cathepsin B protease secreted by newly excysted juvenile liver fluke. Exp Parasitol 88, 85-94, doi:10.1006/expr.1998.4234 (1998).

117 335 Pratt, D., Cox, G. N., Milhausen, M. J. & Boisvenue, R. J. A developmentally regulated cysteine protease gene family in Haemonchus contortus. Molecular and biochemical parasitology 43, 181-191 (1990). 336 Pratt, D. et al. Cloning and sequence comparisons of four distinct cysteine proteases expressed by Haemonchus contortus adult worms. Molecular and biochemical parasitology 51, 209-218 (1992). 337 Klinkert, M. Q., Felleisen, R., Link, G., Ruppel, A. & Beck, E. Primary structures of Sm31/32 diagnostic proteins of Schistosoma mansoni and their identification as proteases. Molecular and biochemical parasitology 33, 113-122 (1989). 338 Rehman, A. & Jasmer, D. P. A tissue specific approach for analysis of membrane and secreted protein antigens from Haemonchus contortus gut and its application to diverse nematode species. Molecular and biochemical parasitology 97, 55-68 (1998). 339 Skuce, P. J. et al. Molecular cloning and characterization of gut-derived cysteine proteinases associated with a host protective extract from Haemonchus contortus. Parasitology 119 ( Pt 4), 405-412 (1999). 340 Harrop, S. A., Sawangjaroen, N., Prociv, P. & Brindley, P. J. Characterization and localization of cathepsin B proteinases expressed by adult Ancylostoma caninum hookworms. Molecular and biochemical parasitology 71, 163-171 (1995). 341 Cantacessi, C. et al. Differences in transcription between free-living and CO2-activated third-stage larvae of Haemonchus contortus. BMC Genomics 11, 266, doi:10.1186/1471- 2164-11-266 (2010). 342 Tyagi, R. et al. Cracking the nodule worm code advances knowledge of parasite biology and biotechnology to tackle major diseases of livestock. Biotechnol Adv 33, 980-991, doi:10.1016/j.biotechadv.2015.05.004 (2015). 343 Krautz-Peterson, G. & Skelly, P. J. Schistosome asparaginyl endopeptidase (legumain) is not essential for cathepsin B1 activation in vivo. Molecular and biochemical parasitology 159, 54-58, doi:10.1016/j.molbiopara.2007.12.011 (2008). 344 Beckham, S. A. et al. A major cathepsin B protease from the liver fluke Fasciola hepatica has atypical active site features and a potential role in the digestive tract of newly excysted juvenile parasites. Int J Biochem Cell Biol 41, 1601-1612, doi:10.1016/j.biocel.2009.02.003 (2009). 345 Cancela, M. et al. Survey of transcripts expressed by the invasive juvenile stage of the liver fluke Fasciola hepatica. BMC Genomics 11, 227, doi:10.1186/1471-2164-11-227 (2010). 346 McGonigle, L. et al. The silencing of cysteine proteases in Fasciola hepatica newly excysted juveniles using RNA interference reduces gut penetration. International journal for parasitology 38, 149-155, doi:10.1016/j.ijpara.2007.10.007 (2008). 347 Liu, R. D. et al. Screening and characterization of early diagnostic antigens in excretory- secretory proteins from Trichinella spiralis intestinal infective larvae by immunoproteomics. Parasitol Res 115, 615-622, doi:10.1007/s00436-015-4779-2 (2016). 348 Trap, C. et al. Cloning and analysis of a cDNA encoding a putative serine protease comprising two trypsin-like domains of Trichinella spiralis. Parasitol Res 98, 288-294, doi:10.1007/s00436-005-0075-x (2006). 349 Young, N. D. et al. The genome provides insights into life in the bile duct. Nat Commun 5, 4378, doi:10.1038/ncomms5378 (2014). 350 de Boer, J. P. et al. Alpha-2-macroglobulin functions as an inhibitor of fibrinolytic, clotting, and neutrophilic proteinases in sepsis: studies using a baboon model. Infect Immun 61, 5035-5043 (1993). 351 Sripa, B. et al. The tumorigenic liver fluke Opisthorchis viverrini--multiple pathways to cancer. Trends Parasitol 28, 395-407, doi:10.1016/j.pt.2012.07.006 (2012).

118 352 Chen, N. et al. Identification of a nematode chemosensory gene family. Proceedings of the National Academy of Sciences of the United States of America 102, 146-151, doi:10.1073/pnas.0408307102 (2005). 353 Thomas, J. H., Kelley, J. L., Robertson, H. M., Ly, K. & Swanson, W. J. Adaptive evolution in the SRZ chemoreceptor families of Caenorhabditis elegans and Caenorhabditis briggsae. Proceedings of the National Academy of Sciences of the United States of America 102, 4476-4481, doi:10.1073/pnas.0406469102 (2005). 354 Park, D. et al. Interaction of structure-specific and promiscuous G-protein-coupled receptors mediates small-molecule signaling in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America 109, 9917-9922, doi:10.1073/pnas.1202216109 (2012). 355 Cully, D. F. & Paress, P. S. Solubilization and characterization of a high affinity ivermectin binding site from Caenorhabditis elegans. Mol Pharmacol 40, 326-332 (1991). 356 Jones, A. K. & Sattelle, D. B. The cys-loop ligand-gated ion channel gene superfamily of the nematode, Caenorhabditis elegans. Invert Neurosci 8, 41-47, doi:10.1007/s10158-008- 0068-4 (2008). 357 Dufour, V., Beech, R. N., Wever, C., Dent, J. A. & Geary, T. G. Molecular cloning and characterization of novel glutamate-gated chloride channel subunits from Schistosoma mansoni. PLoS Pathog 9, e1003586, doi:10.1371/journal.ppat.1003586 (2013). 358 Beg, A. A. & Jorgensen, E. M. EXP-1 is an excitatory GABA-gated cation channel. Nat Neurosci 6, 1145-1152, doi:10.1038/nn1136 (2003). 359 Dent, J. A. Evidence for a diverse Cys-loop ligand-gated ion channel superfamily in early bilateria. J Mol Evol 62, 523-535, doi:10.1007/s00239-005-0018-2 (2006). 360 Putrenko, I., Zakikhani, M. & Dent, J. A. A family of acetylcholine-gated chloride channel subunits in Caenorhabditis elegans. J Biol Chem 280, 6392-6398, doi:10.1074/jbc.M412644200 (2005). 361 Ribeiro, P., Gupta, V. & El-Sakkary, N. Biogenic amines and the control of neuromuscular signaling in schistosomes. Invert Neurosci 12, 13-28, doi:10.1007/s10158-012-0132-y (2012). 362 Theodoulou, F. L. & Kerr, I. D. ABC transporter research: going strong 40 years on. Biochem Soc Trans 43, 1033-1040, doi:10.1042/BST20150139 (2015). 363 Sheps, J. A., Ralph, S., Zhao, Z., Baillie, D. L. & Ling, V. The ABC transporter gene family of Caenorhabditis elegans has implications for the evolutionary dynamics of multidrug resistance in eukaryotes. Genome biology 5, R15, doi:10.1186/gb-2004-5-3-r15 (2004). 364 Schumacher, T. & Benndorf, R. A. ABC Transport Proteins in Cardiovascular Disease-A Brief Summary. Molecules 22, doi:10.3390/molecules22040589 (2017). 365 Lincke, C. R., Broeks, A., The, I., Plasterk, R. H. & Borst, P. The expression of two P- glycoprotein (pgp) genes in transgenic Caenorhabditis elegans is confined to intestinal cells. EMBO J 12, 1615-1620 (1993). 366 Kurz, C. L., Shapira, M., Chen, K., Baillie, D. L. & Tan, M. W. Caenorhabditis elegans pgp-5 is involved in resistance to bacterial infection and heavy metal and its regulation requires TIR-1 and a p38 map kinase cascade. Biochem Biophys Res Commun 363, 438-443, doi:10.1016/j.bbrc.2007.08.190 (2007). 367 Issouf, M. et al. Haemonchus contortus P-glycoproteins interact with host granules: a novel insight into the role of ABC transporters in host-parasite interaction. PloS one 9, e87802, doi:10.1371/journal.pone.0087802 (2014). 368 Kerboeuf, D., Blackhall, W., Kaminsky, R. & von Samson-Himmelstjerna, G. P-glycoprotein in helminths: function and perspectives for anthelmintic treatment and reversal of resistance. Int J Antimicrob Agents 22, 332-346 (2003).

119 369 Bosch, I. B., Wang, Z. X., Tao, L. F. & Shoemaker, C. B. Two Schistosoma mansoni cDNAs encoding ATP-binding cassette (ABC) family proteins. Molecular and biochemical parasitology 65, 351-356 (1994). 370 Mapes, J. et al. CED-1, CED-7, and TTR-52 regulate surface phosphatidylserine expression on apoptotic and phagocytic cells. Current biology : CB 22, 1267-1275, doi:10.1016/j.cub.2012.05.052 (2012). 371 Furlong, S. T. & Caulfield, J. P. Schistosoma mansoni: sterol and phospholipid composition of cercariae, schistosomula, and adults. Exp Parasitol 65, 222-231 (1988). 372 Ghedin, E. et al. Draft genome of the filarial nematode parasite Brugia malayi. Science 317, 1756-1760, doi:10.1126/science.1145406 (2007). 373 Bei, Y. et al. SRC-1 and Wnt signaling act together to specify endoderm and to control cleavage orientation in early C. elegans embryos. Dev Cell 3, 113-125 (2002). 374 Chen, D. et al. Germline signaling mediates the synergistically prolonged longevity produced by double mutations in daf-2 and rsks-1 in C. elegans. Cell Rep 5, 1600-1610, doi:10.1016/j.celrep.2013.11.018 (2013). 375 Hashiguchi, M. & Hashiguchi, T. Kinase-kinase interaction and modulation of tau phosphorylation. Int Rev Cell Mol Biol 300, 121-160, doi:10.1016/B978-0-12-405210- 9.00004-7 (2013). 376 Liao, J. C., Yang, T. T., Weng, R. R., Kuo, C. T. & Chang, C. W. TTBK2: a tau protein kinase beyond tau phosphorylation. Biomed Res Int 2015, 575170, doi:10.1155/2015/575170 (2015). 377 LaMunyon, C. W. et al. A New Player in the Spermiogenesis Pathway of Caenorhabditis elegans. Genetics 201, 1103-1116, doi:10.1534/genetics.115.181172 (2015). 378 Muhlrad, P. J., Clark, J. N., Nasri, U., Sullivan, N. G. & LaMunyon, C. W. SPE-8, a protein- tyrosine kinase, localizes to the spermatid cell membrane through interaction with other members of the SPE-8 group spermatid activation signaling pathway in C. elegans. BMC Genet 15, 83, doi:10.1186/1471-2156-15-83 (2014). 379 Mila, D. et al. Asymmetric Wnt Pathway Signaling Facilitates Stem Cell-Like Divisions via the Nonreceptor Tyrosine Kinase FRK-1 in Caenorhabditis elegans. Genetics 201, 1047- 1060, doi:10.1534/genetics.115.181412 (2015). 380 Kim, T. H., Kim, Y. J., Cho, J. W. & Shim, J. A novel zinc-carboxypeptidase SURO-1 regulates cuticle formation and body morphogenesis in Caenorhabditis elegans. FEBS Lett 585, 121-127, doi:10.1016/j.febslet.2010.11.020 (2011). 381 Brennan, R. G. & Matthews, B. W. The helix-turn-helix DNA binding motif. J Biol Chem 264, 1903-1906 (1989). 382 Protasio, A. V. et al. A systematically improved high quality genome and transcriptome of the human blood fluke Schistosoma mansoni. PLoS neglected tropical diseases 6, e1455, doi:10.1371/journal.pntd.0001455 (2012). 383 Korting, W. & Fairbairn, D. Changes in beta-oxidation and related enzymes during the life cycle of Strongyloides ratti (Nematoda). J Parasitol 57, 1153-1158 (1971). 384 Barrett, J. & Korting, W. Lipid Catabolism in Plerocercoids of Schistocephalus-Solidus (Cestoda-Pseudophyllidea). International journal for parasitology 7, 419-422, doi:Doi 10.1016/0020-7519(77)90068-6 (1977). 385 Fraga, C. M. et al. Alternative energy production pathways in Taenia crassiceps cysticerci in vitro exposed to a derivative (RCB20). Parasitology 143, 488-493, doi:10.1017/S0031182015001729 (2016). 386 Skelly, P. J., Da'dara, A. A., Li, X. H., Castro-Borges, W. & Wilson, R. A. Schistosome feeding and regurgitation. PLoS Pathog 10, e1004246, doi:10.1371/journal.ppat.1004246 (2014). 387 Rodriguez-Contreras, D., Skelly, P. J., Landa, A., Shoemaker, C. B. & Laclette, J. P. Molecular and functional characterization and tissue localization of 2 glucose transporter

120 homologues (TGTP1 and TGTP2) from the tapeworm Taenia solium. Parasitology 117 ( Pt 6), 579-588 (1998). 388 Barrett, J., Ward, C. W. & Fairbairn, D. Glyoxylate Cycle and Conversion of Triglycerides to Carbohydrates in Developing Eggs of Ascaris-Lumbricoides. Comparative Biochemistry and Physiology 35, 577-+, doi:Doi 10.1016/0010-406x(70)90974-6 (1970). 389 Komuniecki, R. & Harris, B. G. in Biochemistry and Molecular Biology of Parasites. (eds J. J. Marr & M. Muller) Ch. 4, 49-66 (Academic Press, 1995). 390 Madin, K. A. C., Loomis, S. H. & Crowe, J. H. Anhydrobiosis in Nematodes - Control of Carbon Flow through the Glyoxylate Cycle. J Exp Zool 234, 341-350, doi:DOI 10.1002/jez.1402340303 (1985). 391 Barrett, J. Forty years of helminth biochemistry. Parasitology 136, 1633-1642, doi:10.1017/S003118200900568X (2009). 392 Prichard, R. K. & SChofield, P. J. The glyoxylate cycle, fructose-1,6-diphosphatase and glyconeogenesis in Fasciola hepatica. Comparative Biochemistry and Physiology 29, 581- 590, doi:10.1016/0010-406X(69)91609-0 (1968). 393 Rafi, M. M. & Raj, R. K. Phosphoenolpyruvate-Succinate-Glyoxylate Pathway in the Filarial Parasite Setaria-Digitata. J Bioscience 16, 121-126, doi:Doi 10.1007/Bf02703364 (1991). 394 Voet, D. & Voet, J. G. Biochemistry., (John Wiley & Sons, 2011). 395 Okubo, Y., Yang, S., Chistoserdova, L. & Lidstrom, M. E. Alternative route for glyoxylate consumption during growth on two-carbon compounds by Methylobacterium extorquens AM1. J Bacteriol 192, 1813-1823, doi:10.1128/JB.01166-09 (2010). 396 Rubin, H. & Trelease, R. N. Subcellular localization of glyoxylate cycle enzymes in Ascaris suum larvae. J Cell Biol 70, 374-383 (1976). 397 Danpure, C. J. et al. Subcellular distribution of hepatic alanine:glyoxylate aminotransferase in various mammalian species. J Cell Sci 97 ( Pt 4), 669-678 (1990). 398 Wang, T. et al. Proteomic analysis of the excretory-secretory products from larval stages of Ascaris suum reveals high abundance of glycosyl hydrolases. PLoS neglected tropical diseases 7, e2467, doi:10.1371/journal.pntd.0002467 (2013). 399 Braeckman, B. P., Houthoofd, K. & Vanfleteren, J. R. Intermediary metabolism. WormBook, 1-24, doi:10.1895/wormbook.1.146.1 (2009). 400 Muller, M. et al. Biochemistry and evolution of anaerobic energy metabolism in eukaryotes. Microbiol Mol Biol Rev 76, 444-495, doi:10.1128/MMBR.05024-11 (2012). 401 Maule, A. G. & Marks, N. J. (CABI, 2005). 402 Kohler, P. The strategies of energy conservation in helminths. Molecular and biochemical parasitology 17, 1-18 (1985). 403 Inaoka, D. K. et al. Structural Insights into the Molecular Design of Flutolanil Derivatives Targeted for Fumarate Respiration of Parasite Mitochondria. Int J Mol Sci 16, 15287- 15308, doi:10.3390/ijms160715287 (2015). 404 Kim, J., Hannibal, L., Gherasim, C., Jacobsen, D. W. & Banerjee, R. A human vitamin B12 trafficking protein uses glutathione transferase activity for processing alkylcobalamins. J Biol Chem 284, 33418-33424, doi:10.1074/jbc.M109.057877 (2009). 405 Froese, D. S. et al. Structural Insights into the MMACHC-MMADHC Protein Complex Involved in Vitamin B12 Trafficking. J Biol Chem 290, 29167-29177, doi:10.1074/jbc.M115.683268 (2015). 406 Yamada, K., Gherasim, C., Banerjee, R. & Koutmos, M. Structure of Human B12 Trafficking Protein CblD Reveals Molecular Mimicry and Identifies a New Subfamily of Nitro-FMN Reductases. J Biol Chem 290, 29155-29166, doi:10.1074/jbc.M115.682435 (2015). 407 Barrett, J. Biochemistry of parasitic helminths. (MacMillan Publishers Ltd., 1981).

121 408 Roth, J. R., Lawrence, J. G. & Bobik, T. A. Cobalamin (coenzyme B12): synthesis and biological significance. Annu Rev Microbiol 50, 137-181, doi:10.1146/annurev.micro.50.1.137 (1996). 409 Glenwright, A. J. et al. Structural basis for nutrient acquisition by dominant members of the human gut microbiota. Nature 541, 407-411, doi:10.1038/nature20828 (2017). 410 Woodson, J. D., Zayas, C. L. & Escalante-Semerena, J. C. A new pathway for salvaging the coenzyme B12 precursor cobinamide in archaea requires cobinamide-phosphate synthase (CbiB) enzyme activity. J Bacteriol 185, 7193-7201 (2003). 411 Rosa, B. A., Jasmer, D. P. & Mitreva, M. Genome-wide tissue-specific gene expression, co- expression and regulation of co-expressed genes in adult nematode Ascaris suum. PLoS neglected tropical diseases 8, e2678, doi:10.1371/journal.pntd.0002678 (2014). 412 Tielens, A. G., van Grinsven, K. W., Henze, K., van Hellemond, J. J. & Martin, W. Acetate formation in the energy metabolism of parasitic helminths and protists. International journal for parasitology 40, 387-397, doi:10.1016/j.ijpara.2009.12.006 (2010). 413 Von Brand, T. Biochemistry of parasites., (Academic Press, 1973). 414 Jin, L. et al. Glutamate dehydrogenase 1 signals through antioxidant glutathione peroxidase 1 to regulate redox homeostasis and tumor growth. Cancer Cell 27, 257-270, doi:10.1016/j.ccell.2014.12.006 (2015). 415 Mustafa, T., Komuniecki, R. & Mettrick, D. F. Cytosolic Glutamate-Dehydrogenase in Adult Hymenolepis-Diminuta (Cestoda). Comp Biochem Phys B 61, 219-222, doi:Doi 10.1016/0305-0491(78)90164-5 (1978). 416 Skuce, P. J., Stewart, E. M., Smith, W. D. & Knox, D. P. Cloning and characterization of glutamate dehydrogenase (GDH) from the gut of Haemonchus contortus. Parasitology 118 ( Pt 3), 297-304 (1999). 417 Kikuchi, G., Motokawa, Y., Yoshida, T. & Hiraga, K. Glycine cleavage system: reaction mechanism, physiological significance, and hyperglycinemia. Proc Jpn Acad Ser B Phys Biol Sci 84, 246-263 (2008). 418 Zheng, H. et al. The genome of the hydatid tapeworm Echinococcus granulosus. Nat Genet 45, 1168-1175, doi:10.1038/ng.2757 (2013). 419 Mukherjee, M., Brown, M. T., McArthur, A. G. & Johnson, P. J. Proteins of the glycine decarboxylase complex in the hydrogenosome of Trichomonas vaginalis. Eukaryot Cell 5, 2062-2071, doi:10.1128/EC.00205-06 (2006). 420 Jerlstrom-Hultqvist, J. et al. Hydrogenosomes in the diplomonad Spironucleus salmonicida. Nat Commun 4, 2493, doi:10.1038/ncomms3493 (2013). 421 Payne, S. H. & Loomis, W. F. Retention and loss of amino acid biosynthetic pathways based on analysis of whole-genome sequences. Eukaryot Cell 5, 272-276, doi:10.1128/EC.5.2.272-276.2006 (2006). 422 Barrett, J. Amino acid metabolism in helminths. Adv Parasitol 30, 39-105 (1991). 423 Kurelec, B. Aspartate transcarbamylase in some parasitic platyhelminths. Comp Biochem Physiol B 47, 33-40 (1974). 424 Hill, B., Kilsby, J., Rogerson, G. W., McIntosh, R. T. & Ginger, C. D. The enzymes of pyrimidine biosynthesis in a range of parasitic protozoa and helminths. Molecular and biochemical parasitology 2, 123-134 (1981). 425 Heath, R. L. & Hart, J. L. Biosynthesis de novo of purines and pyrimidines in Mesocestoides (Cestoda). II. J Parasitol 56, 340-345 (1970). 426 Desjardins, C. A. et al. Genomics of Loa loa, a Wolbachia-free filarial parasite of humans. Nat Genet 45, 495-500, doi:10.1038/ng.2585 (2013). 427 Foster, J. et al. The Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic nematode. PLoS Biol 3, e121, doi:10.1371/journal.pbio.0030121 (2005).

122 428 Jaffe, J. J. & Chrin, L. R. Involvement of tetrahydrofolate cofactors in de novo purine ribonucleotide synthesis by adult Brugia pahangi and Dirofilaria immitis. Molecular and biochemical parasitology 2, 259-270 (1981). 429 Schistosoma japonicum Genome, S. & Functional Analysis, C. The Schistosoma japonicum genome reveals features of host-parasite interplay. Nature 460, 345-351, doi:10.1038/nature08140 (2009). 430 Toh, S. Q., Glanfield, A., Gobert, G. N. & Jones, M. K. Heme and blood-feeding parasites: friends or foes? Parasit Vectors 3, 108, doi:10.1186/1756-3305-3-108 (2010). 431 Luck, A. N. et al. Heme acquisition in the parasitic filarial nematode Brugia malayi. FASEB J 30, 3501-3514, doi:10.1096/fj.201600603R (2016). 432 Glanfield, A., McManus, D. P., Anderson, G. J. & Jones, M. K. Pumping iron: a potential target for novel therapeutics against schistosomes. Trends Parasitol 23, 583-588, doi:10.1016/j.pt.2007.08.018 (2007). 433 Stubenhaus, B. M. et al. Light-induced depigmentation in planarians models the pathophysiology of acute porphyrias. Elife 5, doi:10.7554/eLife.14175 (2016). 434 de Jong, L., Meng, Y., Dent, J. & Hekimi, S. Thiamine pyrophosphate biosynthesis and transport in the nematode Caenorhabditis elegans. Genetics 168, 845-854, doi:10.1534/genetics.104.028605 (2004). 435 Walker, J. & Barrett, J. Pyridoxal 5'-phosphate dependent enzymes in the nematode Nippostrongylus brasiliensis. International journal for parasitology 21, 641-649 (1991). 436 Weinstein, P. P. & Jaffe, J. J. Cobalamin and folate metabolism in helminths. Blood Rev 1, 245-253 (1987). 437 Rossi, M., Amaretti, A. & Raimondi, S. Folate production by probiotic bacteria. Nutrients 3, 118-134, doi:10.3390/nu3010118 (2011). 438 James, C. E., Hudson, A. L. & Davey, M. W. Drug resistance mechanisms in helminths: is it survival of the fittest? Trends Parasitol 25, 328-335, doi:10.1016/j.pt.2009.04.004 (2009). 439 Vercruysse, J., Levecke, B. & Prichard, R. Human soil-transmitted helminths: implications of . Curr Opin Infect Dis 25, 703-708, doi:10.1097/QCO.0b013e328358993a (2012). 440 Peixoto, C. A. & Silva, B. S. Anti-inflammatory effects of diethylcarbamazine: a review. Eur J Pharmacol 734, 35-41, doi:10.1016/j.ejphar.2014.03.046 (2014). 441 Martin, R. J., Robertson, A. P. & Bjorn, H. Target sites of anthelmintics. Parasitology 114 Suppl, S111-124 (1997). 442 Pensel, P. E., Albani, C., Gamboa, G. U., Benoit, J. P. & Elissondo, M. C. In vitro effect of 5-fluorouracil and paclitaxel on Echinococcus granulosus larvae and cells. Acta Trop 140, 1-9, doi:10.1016/j.actatropica.2014.07.013 (2014).

123