A Comprehensive Resource for Helminth Genomics

G Model MOLBIO-11030; No. of Pages 9 ARTICLE IN PRESS Molecular & Biochemical Parasitology xxx (2016) xxx–xxx Contents lists available at ScienceDirect Molecular & Biochemical Parasitology − WormBase ParaSite a comprehensive resource for helminth genomics a,∗ a b a b Kevin L. Howe , Bruce J. Bolt , Myriam Shafie , Paul Kersey , Matthew Berriman a European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK b Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK a r t i c l e i n f o a b s t r a c t Article history: The number of publicly available parasitic worm genome sequences has increased dramatically in the Received 7 October 2016 past three years, and research interest in helminth functional genomics is now quickly gathering pace in Received in revised form response to the foundation that has been laid by these collective efforts. A systematic approach to the 24 November 2016 organisation, curation, analysis and presentation of these data is clearly vital for maximising the util- Accepted 25 November 2016 ity of these data to researchers. We have developed a portal called WormBase ParaSite (http://parasite. Available online xxx wormbase.org) for interrogating helminth genomes on a large scale. Data from over 100 nematode and platyhelminth species are integrated, adding value by way of systematic and consistent functional annota- Keywords: tion (e.g. protein domains and Gene Ontology terms), gene expression analysis (e.g. alignment of life-stage Genome browser specific transcriptome data sets), and comparative analysis (e.g. orthologues and paralogues). We provide Comparative genomics Functional genomics several ways of exploring the data, including genome browsers, genome and gene summary pages, text Helminths search, sequence search, a query wizard, bulk downloads, and programmatic interfaces. In this review, WormBase we provide an overview of the back-end infrastructure and analysis behind WormBase ParaSite, and the Ensembl displays and tools available to users for interrogating helminth genomic data. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction agricultural importance and therefore attract research interest in their own right. The first parasitic nematode to have its genome The WormBase project (http://www.wormbase.org, [1]) was fully sequenced was Brugia malayi [3], and since then, many have initiated to facilitate and accelerate biological research that uses the followed [4–37]. Of the WormBase “core” species (the ones for model nematode Caenorhabditis elegans, by making the collected which the reference genome sequence and annotation are curated), outputs of scientists accessible from a single resource. This enables three are parasitic worms: Brugia malayi, Onchocerca volvulus and the transfer of this wealth of knowledge to the study of other meta- Strongyloides ratti (http://www.wormbase.org/species). However, zoa, from nematodes to humans. C. elegans data described in the there are a number of challenges associated with expanding the research literature, deposited in the archives, or submitted directly remit to helminths. Firstly, the primary research goal of parasitol- is placed into context via a combination of detailed manual cura- ogists is to identify ways of controlling the parasite, and as such tion and semi-automatic data integration. In addition, the project their desired entry points and common use-cases for WormBase curates the reference genome sequence, gene structures and other are often distinct from those of scientists doing basic science using genomic features for C. elegans, thereby providing a high quality C. elegans as a model. Secondly, the wider community of worm par- foundation for downstream studies. The WormBase mission also asitologists includes those studying platyhelminths (flatworms), extends to free-living relatives of C. elegans, which were the focus which are outside the taxonomic scope of WormBase, which is of early post-genomic research in the areas of evolution and com- a nematode resource. Thirdly, there have been recent concerted parative biology [2]. efforts to sequence the genomes of many helminths (nematodes In recent years, WormBase has begun to expand its mission and platyhelmiths) (e.g. the 50 Helminth Genomes Initiative [38]), to include plant and animal parasitic nematodes which, while resulting in a flood of new draft genomes that vary considerably in more distantly related to C. elegans, have direct biomedical and contiguity. In response to these challenges, we have created WormBase Par- aSite, a comprehensive new resource for parasitic worm genomes, ∗ aiming to serve parasitologists working on helminths who use Corresponding author. E-mail address: [email protected] (K.L. Howe). genomics as an investigative tool. WormBase ParaSite leverages the http://dx.doi.org/10.1016/j.molbiopara.2016.11.005 0166-6851/© 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4. 0/). Please cite this article in press as: K.L. Howe, et al., WormBase ParaSite − a comprehensive resource for helminth genomics, Mol Biochem Parasitol (2016), http://dx.doi.org/10.1016/j.molbiopara.2016.11.005 G Model MOLBIO-11030; No. of Pages 9 ARTICLE IN PRESS 2 K.L. Howe et al. / Molecular & Biochemical Parasitology xxx (2016) xxx–xxx Table 1 infrastructure and expertise of the WormBase project, with a main Species with RNA-Seq expression tracks in WormBase ParaSite (release 7). Included mission of breadth (foundational data for many species) rather than are some unpublished studies that are publicly available via the European depth (rich data for a single species). Nucleotide Archive. Species Studies Samples (tracks) Nematode Brugia malayi 3 [58,59] 57 2. Data integration and analysis Onchocerca volvulus 2 [60,61] 19 Strongyloides ratti 2 [34] 8 2.1. Genomes Strongyloides stercoralis 2 [62] 42 Trichuris muris 1 [25] 22 Our mission is to include all publicly available nematode and Platyhelminth Echinococcus granulosus 2 [15] 6 platyhelminth genomes in WormBase ParaSite. Where multi- Echinococcus multicularis 2 [15] 35 Fasciola hepatica 2 [31] 17 ple genome assemblies exist for the same species (e.g. different Hymenolepis microstoma 2 [15] 17 genome projects for Haemonchus contortus [18,19], and male and Schistosoma mansoni 6 [63–67] 33 female isolates for Trichuris suis [24]), we have included all. Release 7 of the resource (August 2016) included genomes from 82 nematode species (98 genomes) and 28 platyhelminth species (30 2.2.2. Functional annotation genomes). We maintain an up-to-date list of genomes for ease of Literature describing gene function is sparse for most of the reference (http://parasite.wormbase.org/species.html). species in WormBase ParaSite, although we anticipate that the Our primary source for genome sequences is the International availability of the genomes will stimulate new research. In lieu of Nucleotide Sequence Database Collaboration (INSDC) resources annotations based on experimental evidence, we have used estab- [39]. In rare cases, we collect genomes from project-specific FTP lished automated methods to predict the function of as many gene sites or direct engagement with genome project scientists. How- products as possible. Firstly, we broker the submission of all protein ever, it is our strong preference to use genomes deposited with sequences in WormBase ParaSite to the UniProt Knowledgebase INSDC as these have been formally checked, processed and ver- [51], and import the product names assigned by them. These are sioned by an authoritative sequence archive resource. As well as defined by a combination of manual curation and automatic anno- allowing us to formally disambiguate between different genomes tation. For more detailed annotation, we use InterProScan [52] from for the same species, by way of the INSDC BioProject identifier the InterPro project [53]. As well as predicting protein domains (e.g. (http://www.ncbi.nlm.gov/bioproject), it also simplifies the pro- from the Pfam [54] database), it also assigns terms from the Gene cess of integrating and displaying functional genomics data that has Ontology (GO) [55]. We run the latest version of the pipeline and been submitted to the archives. For the species in common between data across the complete proteome for every genome each release, WormBase and WormBase ParaSite (C. elegans and other free-living so that we are always up-to-date with the latest InterPro domain nematodes, Brugia malayi, Strongyloides ratti and Onchocerca volvu- and Gene Ontology annotations. lus), we synchronise the data with a specific release of WormBase. For example, WormBase ParaSite release 7 was synchronised with 2.2.3. Gene expression WormBase release WS254, meaning that data for species in com- High-throughput sequencing of RNA has become the standard mon were identical in both resources. assay for measuring gene expression, and numerous studies con- ducting “RNA-Seq” experiments in helminth species have now been performed and deposited in the sequence archives. We ultimately 2.2. Genome annotation aim to align all helminth RNA-Seq data to the corresponding reference genome using a standard pipeline, in collaboration with the 2.2.1. Gene structures Gene Expression Atlas project [56] (see Section 4). In the meantime, The definition of the intron/exon structure of protein-coding we have processed data for a small number of species using our own genes is an important foundation for interpretation of genome pipeline, as a pilot study (Table 1). Briefly, we use STAR [57] to align function. WormBase curates gene structures for a C. elegans and a each experiment to the reference genome, merging resulting BAM small set of nematode species with high-quality reference genomes files when they are technical replicates of the same sample. We [40]. For others, we import annotations from the acknowledged have labelled each sample with the appropriate descriptive terms, authority for that genome (e.g.

A Comprehensive Resource for Helminth Genomics

Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Bioinformatics Study of Lectins: New Classification and Prediction In

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

UC Davis UC Davis Previously Published Works

Methods in and Applications of the Sequencing of Short Non-Coding Rnas" (2013)

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Comparing Tools for Non-Coding RNA Multiple Sequence Alignment Based On

Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3