Virosaurus

user manual Version 2.3 (14.07.2020)

Virosaurus 2020_04 files: https://viralzone.expasy.org/8676

Virosaurus (virus thesaurus) is a database for representative viral sequences of known genetic diversity, and curated in order to facilitate clinical metagenomics analysis.

Quickstart The best way to use the database is have all reads mapped to Virosaurus sequences, then group sequences having the same usual name. Doing so, the results should be a small list of species, with extra data to assess quality of detection; like number of reads, coverage and sequencing depth.

Database contents Virosaurus contains full-length (monopartite genomes) or segments (segmented genomes) for all virus families comprising at least one species infecting vertebrate. Virosaurus covers all genetic diversity available in GenBank. All available sequences were clustered at 90% to remove redundancy in Virosaurus 90 (23,615 FASTAs); or clustered at 98% in Virosaurus98 (73,160 FASTAs). Clusters can belong to the same virus species. This happens for highly variable like Lassa: there are 100 Lassa clusters in Virosaurus90, 637 in Virosaurus98.

The FASTA header have been annotated with metadata to facilitate metagenomic analysis. For instance, viral nucleic acid is annotated as RNA, DNA or RNA/DNA, thereby improving interpretation from sequencing of either molecule.

In the Virosaurus release 2020_04, herpesviridae and poxviridae sequences are split in genes rather than full genomes. This allows using incomplete genome sequences, and helps to mitigate the low number of complete genomes versus high variability for those families. Licence : Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) FASTA format Annotation are stored in FASTA header. The header contains 11 different topics annotated by a controlled vocabulary. Data comes from GenBank, ICTV, ViralZone and manual curation.

FASTA header:

: ; usual name=; clinical level=; clinical typing=; species=; taxid=; acronym=; nucleic acid=; circular=; segment=;

Genbank accession number of the sequence displayed in the FASTA.

By default Genbank accession number of the sequence displayed in the FASTA. If the displayed sequence is a portion of a GenBank entry, for example in the case of genes, the Sequence ID is a unique identifier like GENE_583-3988.

Usual name= Name of clinical level entity; If the scientific name is not commonly used, the common clinical name replaces species official name, for example parvovirus B19 is the usual name of Primate erythroparvovirus 1 species. If clinical level =genus: genus name or acronym, for example all Alphatorquevirus usual name is TTV.

Clinical level= Gives the taxonomic level suggested to be relevant for usual clinical diagnostics. By default , but can be at level like for TTVs or HPVs.

Clinical typing= Unknown by default. Otherwise contains data clinically relevant below species level. This can be genotypes (example:HCV) or qualifiers ( enterovirus, High risk HPV, etc…). In rare cases of mixed cluster, several typings are listed separated by a coma. This notably happens for some HPVs “low risk” and “undetermined risk” which can be mixed in one cluster.

Species= indicates the current official species name, as reported by International Committee on Taxonomy of viruses (ICTV, 07_19): https://talk.ictvonline.org/taxonomy/ . In rare cases cluster comprising more than one species, these are listed separated by a coma. This notably happens for some segments of A and C with are very similar within different species.

Taxid= Taxonomy identifier from NCBI taxonomy database: https://www.ncbi.nlm.nih.gov/taxonomy of the taxonomic entity at species level.

Acronym= Official acronym name of the species, as reported in ViralZone acronym list: https://viralzone.expasy.org/resources/Acronyms.xlsx

Nucleic acid= Nature of viral genome, either RNA or DNA, or RNA/DNA for retro-transcribing viruses (Ortevirales).

Circular= Y or N for yes or no. This is essential for to map efficiently reads at both extremities of the FASTA sequence.

Segment= N/A for monopartite viruses. For segmented genomes: official segment name as reported in ViralZone database: https://viralzone.expasy.org/

Creating a report using Virosaurus data Virosaurus is clustered to lower the redundancy of sequences, which is rather high for HIV-1 and viruses. Each Virosaurus entry is a representative sequence from a cluster. Clusters can comprise between 1 to 20,543sequences.

Virosaurus FASTA header is designed to simplify clinical metagenomics data report by gathering reads under each viral species. The concept is to report reads associated to a entity, rather than to individual sequences representing a cluster.

Figure 1: Example of reads grouped together under a species . Here this virus genetic diversity is represented by 10 sequences in Virosaurus, representing 10 clusters of similar sequences. All clinical reads assigned to the “Human polyomavirus 2” FASTAs can ge added together, resulting in a total of 28 reads for HPyV-2. Doing so makes it easier to check the presence of viruses without having to look at a long list of similar viruses.

Authors:

Anne Gleizes, Florian Laubscher, Nicolas Guex, Christian Iseli, Thomas Junier, Samuel Cordey, Jacques Fellay, Ioannis Xenarios, Laurent Kaiser and Philippe Le Mercier.

Credits :

Virosaurus has been developed by a collaboration between SIB Swiss Institute of Bioinformatics (Vital-IT and Swiss-Prot groups), Université de Genève and Hôpitaux Universitaires de Genève. The development of Virosaurus is supported by the Swiss National Science Foundation (grant 310030_189179).