<<

Comparative genomics of the major parasitic worms

International Helminth Genomes Consortium

Supplementary Figures

Supplementary Fig. 1. Patterns of gene family sharing. (a) The numbers of gene families (pink bars; values on left-hand y-axis) and the numbers of genes in those families (blue bars; values on right-hand y-axis) with particular patterns of sharing between high- level groups in our Compara data. Shading in the lower panel from pink to blue represents how widespread each set of families are, with pink representing families specific to one group and dark blue those families present in all groups. (b) Scatterplot of gene family size against the number of a family is present in, with each point representing a single gene family (families with less than 3 genes are excluded), and points coloured according to the number of higher-level taxonomic groups they are shared between, as in the lower part of panel (a).

Supplementary Fig. 2. Phylogenetic tree of ferrochelatases. Branches are coloured by membership of protein sequences to ferrochelatase Compara familes. Bootstrap support of the main branches is indicated. FeCL (): nematode specific non- functional ferrochelatase-like proteins, devoid of active site (family 740872); CladeIVa FeCL (nematode): synapomorphic non-functional ferrochelatase-like family of nematode clade IVa (family 1184543); FeCH (Non-nematode): functional ferrochelatase family composed of taxa of non-nematode phyla (family 850580 and part of family 787620); Alphaproteobacteria: ferrochelatases of Alphaproteobacteria (Rhizobiales); : ferrochelatases of Leadbetterella byssophila and Mucilaginibacter paludis; Gammaproteobacteria: ferrochelatases of Pseudomonas species; - like: ferrochelatases of Wolbachia, Ehrlichia and Hydrogenobacter species; and HGT-FeCH (nematode): clade III/IV specific functional ferrochelatase acquired through horizontal gene transfer from Alphaproteobacteria (family 787620).

Supplementary Fig. 3. Expanded families involved in immunity and development. Expanded gene families with potential roles in immunity and development. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 4. Distribution and phylogeny of group I SCP/TAPS genes in strongylid . (a) The phylogeny of the group I clade expanded in strongylids (clades Va, Vb and Vc). Red dots show high bootstrap values (≥0.8). For visualisation purposes, the numbered nodes were collapsed in Fig. 3. Zooming into details of a branch reveals that there have been both species-specific expansions (e.g. node 2313) and more ancient expansions shared by several strongylid species (e.g. node 1618). (b) A heatmap showing the distribution of SCP/TAPS gene counts, where each row of the heatmap corresponds to a leaf of the tree in (a). Note that in some cases a leaf corresponds to a consensus sequence representing a group of genes, which belong to species from the same species group (Supplementary Information: Methods 23). (c) A heatmap showing the number of genes per species for each of the numbered nodes from (a).

H.sapiens D.melanogaster S.mediterranea C.sinensis F.hepatica E.caproni S.haematobium S.margrebowiei S.curassoni S.mattheei S.mansoni S.rodhaini S.japonicum T.regenti P.xenopodis D. latus S.solidus S.erinaceieuropaei M.corti T.solium T.asiatica H.taeniaeformis E.granulosus E.multilocularis H.diminuta H.microstoma H.nana R.culicivorax T.nativa T.spiralis T.suis T.trichiura T.muris S.baturini B.xylophilus P.redivivus G.pallida M.hapla S.venezuelensis S.papillosus S.ratti S.stercoralis P.trichosuri Rhabditophanes sp. A.simplex A.lumbricoides P.equorum A.suum T.canis L.loa B.timori B.malayi B.pahangi W.bancrofti A.viteae E.elaphi L.sigmodontis O.flexuosa O.volvulus O.ochengi D.immitis T.callipaeda G.pulchrum D.medinensis E.vermicularis S.muris N.americanus A.caninum A.duodenale A.ceylanicum C.goldi S.vulgaris O.dentatum H.bakeri N.brasiliensis T.circumcincta H.contortus H.placei A.cantonensis A.costaricensis D.viviparus C.elegans Group 1 P.pacificus Group 2 300 200 100 0 0 100 200 SCP/TAPS copies

Supplementary Fig. 5. Distribution of SCP/TAPS genes in each species. The number of length-filtered SCP/TAPS genes used as input for the phylogenetic analysis is shown (Supplementary Information: Methods 23). The vertical black lines show the median numbers of Group 1 and Group 2 genes per species, across all species.

Supplementary Fig. 6. Hypothetical proteins. (a) Histograms of log(family size) for the Compara gene families that lacked functional annotation (protein names; Supplementary Information: Results 2.4) and families with annotations (red and blue lines, respectively). Half of the Compara gene families (46.9%) lacked functional annotation. These families tended to be smaller than those with annotations (P<2.2e-16, Kolmorov-Smirnov test). (b) Size distribution of hypothetical families shows that many were found in a large number of species, suggesting that they contained genuine genes of unknown function. a) Family 833705 b) Family 924596 c) Family 425424 d) Family 689000

e) Family 983993 f) Family 1036589 g) Family 393312 h) Family 747261

i) Family 849084 j) Family 219111 (oxamniquine response) Outgroups Nematoda Vc - Ancylostomatidae+Strongylidae (AS) Platyhelminthes Vb - Lungworm - Schistosomatids Va - other Trematoda - Other V - other IVb - Tylenchomorpha Platyhelminthes - Other IVa - Strongyloididae + other IIIc - Spiruromorpha + other IIIb - Ascaridomorpha IIIa - Oxyuridomorpha I

Supplementary Fig. 7. Striking expansions in families with poorly defined roles. Gene families with striking variation across species but uncharacterised (a-i), or poorly characterised (j) functions. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family. a) Family 3 b) Family 163808 c) Family 213188 d) Family 89256 e) Family 517261 (astacin) (astacin) (astacin) (astacin) (astacin)

f) Family 132363 g) Family 248860 h) Family 38173 i) Family 80624 j) Family 616925 (astacin) (astacin) () (cathepsin) (protease inhibitor, Kunitz-like)

k) Family 358015 l) Family 426447 m) Family 754923 n) Family 576282 (alpha-macroglobulin) (trypsin inhibitor-like) (chymotrypsin/elastase inhibitor) (Kunitz-type inhibitor)

Outgroups Platyhelminthes Nematoda Trematoda - Schistosomatids Vc - Ancylostomatidae+Strongylidae (AS) IVb - Tylenchomorpha IIIc - Spiruromorpha + other I Trematoda - Other Vb - Lungworm IVa - Strongyloididae + other IIIb - Ascaridomorpha Cestoda Va - other Strongylida IIIa - Oxyuridomorpha Platyhelminthes - Other V - other

Supplementary Fig. 8. Protease family expansions. Expanded gene families of proteases and protease inhibitors. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 9. Heatmap of ligand gated ion channels. Relative abundance profiles for all 99 LGICs represented in at least 3 of the 81 helminth species (Supplementary Information: Methods 19). 3 LGICs present in fewer than 3 species were omitted from the visualisation. 1.000

1.000

0.640

0.828 0.802

1.000

1.000

0.699 0.985

0.790 1.000

0.997

1.000 0.913

1.000

0.803

0.981

0.723

0.988 0.582

1.000

0.966 1.000

1.000 0.987

1.000

Phylum Nematoda Platyhelminthes Outgroup

Supplementary Fig. 10. Inferred phylogeny of the cys-loop superfamily in platyhelminths and nematodes. Posterior probabilities were calculated from 8 reversible jump MCMC chains using MrBayes. The tree is rooted between nicotinic acetylcholine receptors and non-nicotinic anion channels. Posterior probabilities are displayed for nodes that correspond to our classification of cys-loop proteins.

Supplementary Fig. 11. Heatmap of ABC transporters. Relative abundance profiles for 50 ABC transporter classes (Supplementary Information: Methods 19).

Supplementary Fig. 12. Comparison of pathway coverage and variation. (a) Taxonomic differences in KEGG reference pathway coverage. For nematode clades, the comparison is with the union of the other five nematode groups. Platyhelminths, Clade IIIc- and Clade IVa- show the most consistent overall pattern. Comparisons with Wilcoxon test P-value <0.05 (FDR corrected using the Benjamini-Hochberg procedure) are considered significant. (b) Within-group variation in KEGG pathway coverage, and differences between groups. Only helminth-relevant KEGG metabolic pathways were considered (Supplementary Information: Methods 25). The y-axis is the coefficient of variation of the KEGG pathway coverage. For nematode clades, the comparison is with the union of the other five nematode groups. (c) Variation among nematodes and platyhelminths in KEGG pathway coverage aggregated according to superpathways. In panels (b) and (c), statistically significant comparisons are shown (red = lower; green = higher): * 0.01

e) Family 586945 f) Family 476273 g) Family 136770 h) Family 144732 (methylmalonate CoA epimerase) (alh-8) (LDH) (glutamate dehydrogenase)

i) Family 150310 j) Family 167895 (acyl CoA synthetase) (lipase)

Outgroups Nematoda Vc - Ancylostomatidae+Strongylidae (AS) Platyhelminthes Vb - Lungworm Trematoda - Schistosomatids Va - other Strongylida Trematoda - Other V - other Cestoda IVb - Tylenchomorpha Platyhelminthes - Other IVa - Strongyloididae + other IIIc - Spiruromorpha + other IIIb - Ascaridomorpha IIIa - Oxyuridomorpha I

Supplementary Fig. 13. Metabolism-related family expansions. Expanded Compara families with metabolism-related functions. Families were defined using Compara. For colour key and species labels, see Fig. 1. The plot for a family shows the gene count in each species, superimposed upon the species tree. A scale bar beside the plot for a family shows the minimum, median, and maximum gene count across the species, for that family.

Supplementary Fig. 14. Maximum likelihood phylogenies of two putative horizontal transfers. Phylogeny of helminth proteins, closely related sequences identified by BLAST and selected other sequences for (a) a clade IIIb-specific cobalamin-related family (CobQ/CbiP) gained from , and (b) an acetate/succinate transporter in clade I nematodes that appears to have been gained from Bacteria, and is likely to participate in acetate/succinate uptake or efflux.

Supplementary Fig. 15. Genome assembly pipelines. The individual pipelines used (a) WTSI (b) MGI and (c) BaNG. Details as defined in Methods.

a) WTSI b) MGI c) BaNG

Pre-processing KOGs and Nematode / Pre-processing / Pipeline setup HMM generation trematode / cestode Genome-specific Genomic contigs Genomic contigs orthologs Genomic contigs sequence data

Repeat Modeler CEGMA MT / contamination Repeat library SNAP HMM GeneMark HMM Model file screen generation Masked contigs HMM evidence EGAP Maker / Augustus Rfam / Proteins GeneMark Augustus / SNAP Ab-initio predictions RNAmmer / Repeatmasker MAKER2 (First Pass) (snap / fgenesh) tRNAscan ESTs Draft gene set 1 Gene calling Maker Aligned taxonomically Integrate with Run: Blastn / Draft gene set 1 Genome- close evidence Maker (round 1) BlastX / Augustus HMM specific, Augustus Nematoda Draft gene set (merges ab initio Species-specific Integrate with protein and Augustus SNAP predictions with ESTs & cDNA, Maker (round 2) EST databases evidence) UniProt, RATT, Final gene set Ab initio GenBLASTG Draft gene set 2 (C. elegans) Augustus High Confidence (HC) gene set generation

UniProt and Aligned broad Integrate with Maker Search Search Fix genes GeneDB evidence Maker (round 3) quality index Swiss-Prot CDD with issues selection

Final processing HC gene set Final gene set Cleanup Legend QC checks Input sequence Manual QC related to data CEGMA gene check GenBank submission Processing step /

k intermediate data c

a Protein Annotation Pipeline (PAP)

b

d e

e Gene sets f

KEGG / Interpro Gene product

l Final gene set a

i annotation naming

t

n

e

t o

P GenBank submission Submission of final gene set for GenBank approval

Supplementary Fig. 16. Gene-finding pipelines. The individual pipelines used at (a) WTSI, (b) MGI and (c) BaNG. Details as defined in Methods. Supplementary Fig. 17. Network representation of Compara families. Species are represented as nodes in the network and edges between two nodes are weighted by the number of times genes of both species appear together in a Compara family. Edges are coloured based on the colour of the phylum of the nodes they connect. Nodes are scaled based on relative proteome size for each species, coloured based on taxonomic membership and labelled according to the following list: 0 = Acanthocheilonema viteae, 1 = Amphimedon queenslandica, 2 = , 3 = , 4 = , 5 = Angiostrongylus cantonensis, 6 = Angiostrongylus costaricensis, 7 = simplex, 8 = , 9 = Ascaris suum, 10 = malayi, 11 = Brugia pahangi, 12 = , 13 = Bursaphelenchus xylophilus, 14 = Caenorhabditis elegans, 15 = Capitella teleta, 16 = Ciona intestinalis, 17 = , 18 = Crassostrea gigas, 19 = Cylicostephanus goldi, 20 = Danio rerio, 21 = Dictyocaulus viviparus, 22 = Dibothriocephalus latus, 23 = Dirofilaria immitis, 24 = , 25 = Drosophila melanogaster, 26 = , 27 = Echinococcus multilocularis, 28 = Echinostoma caproni, 29 = Elaeophora elaphi, 30 = Enterobius vermicularis, 31 = , 32 = Globodera pallida, 33 = pulchrum, 34 = Haemonchus contortus, 35 = Haemonchus placei, 36 = Heligmosomoides bakeri, 37 = Homo sapiens, 38 = Hydatigera taeniaeformis, 39 = , 40 = , 41 = , 42 = Ixodes scapularis, 43 = Litomosoides sigmodontis, 44 = , 45 = Meloidogyne hapla, 46 = Mesocestoides corti, 47 = , 48 = Nematostella vectensis, 49 = Nippostrongylus brasiliensis, 50 = Oesophagostomum dentatum, 51 = Onchocerca flexuosa, 52 = Onchocerca ochengi, 53 = , 54 = Panagrellus redivivus, 55 = , 56 = Parastrongyloides trichosuri, 57 = Pristionchus pacificus, 58 = Protopolystoma xenopodis, 59 = Rhabditophanes kr3021, 60 = Romanomermis culicivorax, 61 = Schistocephalus solidus, 62 = curassoni, 63 = , 64 = , 65 = , 66 = Schistosoma margrebowiei, 67 = Schistosoma mattheei, 68 = Schistosoma rodhaini, 69 = Schmidtea mediterranea, 70 = Soboliphyme baturini, 71 = Spirometra erinaceieuropaei, 72 = Strongyloides papillosus, 73 = Strongyloides ratti, 74 = Strongyloides stercoralis, 75 = Strongyloides venezuelensis, 76 = Strongylus vulgaris, 77 = Syphacia muris, 78 = , 79 = , 80 = Teladorsagia circumcincta, 81 = callipaeda, 82 = , 83 = Trichinella nativa, 84 = , 85 = regenti, 86 = Trichoplax adhaerens, 87 = Trichuris muris, 88 = Trichuris suis, 89 = , 90 = .

Supplementary Fig. 18. Metabolic chokepoints. (a) Number of chokepoints before and after hole filling. (b) Clustering of the species based on presence and absence of the chokepoints. (c) Sharing between nematodes and platyhelminths, of chokepoints present in at least one species of the set (left), and chokepoints present in all species of the set (right).

Supplementary Fig. 19. Unique enzymes annotated and KEGG metabolic modules among the 81 nematode and platyhelminth species. (a) Counts of enzymes (unique EC identifiers) in the 81 platyhelminth and nematode species (i.e. present in any one of the set). (b) Distribution of conserved ECs across nematode and platyhelminth species (i.e. conserved across all species of the set). (c,d) Number of unique ECs annotated and KEGG modules deemed complete in all 81 species (in c) and the tier 1 species (those with high-quality assemblies; Supplementary Information: Methods 7) (in d), along with their proteome size. (e) clustering of all 81 species based on the KEGG module presence (Jaccard similarity index and Ward’s linkage). Bootstrap support values are indicated.

Supplementary Fig. 20. The mitochondrial gene order and phylogeny for nematode species. Elaeophora elaphi, Globodera pallida, Meloidogyne hapla, and Soboliphyme baturini were excluded from the analysis because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. Asterisks indicate that the assembly contains small gaps. The scale bar shows the number of amino acid substitutions per site.

Supplementary Fig. 21. The mitochondrial gene order and phylogeny for trematode species. Schistosoma rodhaini was excluded because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. The scale bar shows the number of amino acid substitutions per site.

Supplementary Fig. 22. The mitochondrial gene order and phylogeny for cestode species. Platyhelminth Schmidtea mediterranea used as an outgroup. Protopolystoma xenopodis was excluded because of insufficient mitochondrial genome data. Inverted sequences are shown by gene boxes with inverted text. The maximum-likelihood tree (left) was constructed using 12 mitochondrial proteins. The scale bar shows the number of amino acid substitutions per site. Supplementary Fig. 23. Cazymes heatmap. Relative bundance profiles for 81 Cazyme families represented in at least 3 of the 81 helminth species. 25 families present in fewer than 3 species were omitted from the visualization. GH = Glycoside hydrolases, PL = Polysaccharide Lyases, GT = Glycosyltransferases, CBM = Carbohydrate-Binding Modules. a) Family 437828 (XPG)

Outgroups Nematoda Vc - Ancylostomatidae+Strongylidae (AS) Platyhelminthes Vb - Lungworm Trematoda - Schistosomatids Va - other Strongylida Trematoda - Other V - other Cestoda IVb - Tylenchomorpha Platyhelminthes - Other IVa - Strongyloididae + other IIIc - Spiruromorpha + other IIIb - Ascaridomorpha IIIa - Oxyuridomorpha I

Supplementary Fig. 24. Example of inexplicable expansion. Expansion of the Compara XPG gene family. The plot shows the gene count in each species, superimposed upon our species tree. A scale bar beside the plot shows the minimum, median, and maximum gene count across the species, for that family. 9 8 9 8 4 8 4 8 5 1 5 1 1 0 1 0 0 0 0 0 F F F F .P .P .P .P 9 9 8 8 9 9 8 8 4 4 8 8 4 4 8 8 5 5 1 1 5 5 1 1 1 1 0 0 p 1 1 0 0 p 0 0 0 0 u 0 0 0 0 u F F F F o F F F F o P P P P r P P P P r . . . . g . . . . g 8 8 8 8 8 8 8 8 8 8 d d s d d s 8 8 8 8 8 e 8 8 8 8 8 e 1 1 1 1 1 e e 1 1 1 1 1 e e d d i d d i 0 0 0 0 0 u c 0 0 0 0 0 u c 0 0 0 0 0 l u 0 0 0 0 0 l u c l e c l e F F F F F x c p F F F F F x c p P P P P P e in s P P P P P e in s P.pacificus Rhabditophanes sp. counts PF00188.PF01549.PF01549 C.elegans P.trichosuri 2 D.viviparus S.stercoralis 350 1 A.costaricensis S.ratti 0 A.cantonensis S.papillosus 300 PF00188.PF01549 H.placei S.venezuelensis 2 H.contortus M.hapla 250 1 T.circumcincta G.pallida 0 N.brasiliensis P.redivivus 200 PF00188.PF00188.PF00188 H.bakeri B.xylophilus 6 O.dentatum S.baturini 150 S.vulgaris T.muris C.goldi T.trichiura 0 A.ceylanicum T.suis 100 PF00188.PF00188 A.duodenale T.spiralis 60 A.caninum T.nativa 50 N.americanus R.culicivorax S.muris H.nana 0 0 E.vermicularis H.microstoma D.medinensis H.diminuta PF00188 G.pulchrum E.multilocularis 250 T.callipaeda E.granulosus species group D.immitis H.taeniaeformis other clade V O.ochengi T.asiatica Vb 50 O.volvulus T.solium Va O.flexuosa M.corti Vc L.sigmodontis S.erinaceieuropaei IIIa E.elaphi S.solidus IIIc A.viteae D. latus IIIb W.bancrofti P.xenopodis IVa IVb B.pahangi T.regenti I B.malayi S.japonicum tapeworms B.timori S.rodhaini other platyhelminths L.loa S.mansoni schistosomatids T.canis S.mattheei other trematodes A.suum S.curassoni outgroup P.equorum S.margrebowiei A.lumbricoides S.haematobium A.simplex E.caproni F.hepatica C.sinensis S.mediterranea D.melanogaster H.sapiens

Supplementary Fig. 25. Domain composition of SCP/TAPS genes. A domain combination is included in this matrix if the combination can be detected in more than five sequences amongst all species. The column labelled ‘excluded’ gives the numbers of genes discarded from the analysis, based on their length (Supplementary Information: Methods 23). The ‘included’ column gives the remaining numbers of length-filtered SCP/TAPS genes in each species.

Supplementary Fig. 26. Kinases heatmap. Relative abundance profiles for all kinase genes, and for 343 individual kinase Compara gene families represented in at least 5 of the 81 helminth species (Supplementary Information: Methods 19). 689 Compara families present in fewer than 5 species and an additional 43 unclassified kinase families were omitted from the visualisation. ‘Total Directly Annotated Kinases’ represents the relative abundance of kinases that were directly annotated per species, while ‘Per Kinase Compara Family’ represents the relative abundance of genes per annotated kinase family.