Input Data Aim of Virmap
Total Page:16
File Type:pdf, Size:1020Kb
Input Data • Next-generation sequencing data • Single-end or paired-end • Mix of viruses, host, bacteria and phage DNA and RNA • Capture sequencing to enrich for viruses Aim of VirMap • Identify reads that map to known viral sequences • De novo sequence assembly and recover putative novel virus sequences • Assign each viral sequence to species (e.g. identify the taxonomy) Instructions for Installation of VirMap on Gadi • Lack of installation instructions (early 2019). Enlisted help from ResTech • Instructions from ResTech: • https://github.com/rsoftone/virmap • Download VirMap pipeline: https://github.com/cmmr/virmap.git Miniconda • Conda modules • (e.g. bbmap, diamond, megahit) • Perl • RocksDB – taxonomy lookup • Recommend not to install in default $TMPDIR (/scratch/projectID/userID/tmp). Please update $TMPDIR with some other location (e.g. export TMPDIR= /g/data/projectID/userID/tmp). Testing VirMap Installation Part 1: Taxonomy database generation • Run script to generate the taxaJson.dat file • The name is misleading - it is not JSON but actually Sereal- encoded, Zstd compressed Perl hash reference built up from NCBI taxonomy files. Code highlights • Download taxonomy data from NCBI • Construct edges to child-nodes from taxonomy data • Sereal encoding + Zstd compression of Perl hash-reference • https://github.com/rsoftone/virmap/tree/master/10- helper-scripts Part 2: Accession -> GI lookup • Downloads a table that converts NCBI accession to GI accession • NCBI accession: NM_000202.5 • GI accession: 262118207 FASTA format header: • blue = fields that are contained in the NCBI Genbank feature table. • yellow = not currently in Genbank’s feature table and need to be obtained via Accession -> GI lookup • Information obtained by correspondence with VirMap authors. Nucleotide: • >gi|2765944|gb|Z99337.1|.Human.immunodeficiency.virus.1.isolate.C49.DNA.for.reverse;taxId=11676 Protein: • >GI|GI:817036220|KR080198.1|AKF14503.1|GI:817036221|hypothetical.protein|Mycobacterium.phage.Ca mbiare;pos=46..537;codonStart=1;taxId=1647305 https://github.com/rsoftone/virmap/tree/master/10-helper-scripts Part 3: Genbank division download • https://github.com/rsoftone/virmap/tree/master/referencedbs • Download aspera fast data transfer (ascp) utility • List files served by GenBank FTP server • Downloading required GenBank divisions • The GenBank database is divided into 18 divisions. For example: • PRI - primate sequences • VRT - other vertebrate sequences • PLN - plant, fungal, and algal sequences • BCT - bacterial sequences • VRL - viral sequences Other Steps for Setting the Database • Part 4: Nucleotide .fasta generation using Gadi PBS script • Part 5: Protein .fasta generation using Gadi PBS script • Part 6: Build BBmap virus database (reads mapping) • Part 7: Build Diamond virus database (protein sequence DB search) • Part 7.5: Build Diamond genbank database (gbBlastx) • Part 8: Build Kraken2 database – find lowest common ancestor (LCA) of all genomes containing the short sequence (k-mer), assign taxonomy to sequence • Part 9: Build blastn genbank database (gbBlastn) Configure virmap_config.sh in the ## Under $TMPDIR/virmap/ virmap_config.sh ## File path, used as the value for the --gbBlastx parameter GB_BLASTX="/g/data1a/projectID/VirMap/200625-gbblastx.dmnd" ## Partial file path, used as the value for the --gbBlastn parameter GB_BLASTN="/g/data1a/projectID/VirMap/191205-gbBlastn/191205-gbBlastn" ## Folder path, used as the value for the --virBbmap parameter VIR_BBMAP="/g/data1a/projectID/VirMap/191205-virBbmap" ## File path, used as the value for the --virDmnd parameter VIR_DMND="/g/data1a/projectID/VirMap/200625-virdiamond.dmnd" ## File path, used as the value for the --taxaJson parameter TAXA_JSON="/g/data1a/projectID/VirMap/200117-taxaJson.dat" Run VirMap Wrapper Script (part 1) # Activate Conda Environment INSTALL_DIR=$TMPDIR/virmap MINICONDA_DIR=$TMPDIR/miniconda3 export TMPDIR= /g/data/projectID/userID/tmp mkdir -p/path/to/store/output cd /path/to/store/output cp $INSTALL_DIR/activate.sh /path/to/store/output source activate.sh Run VirMap Wrapper Script (part 2) qsub -P projectID -q normal -l walltime=6:00:00,\ mem=48G,ncpus=12,wd,jobfs=100GB,storage=scratch/projectID +gdata/projectID -joe -- \ $INSTALL_DIR/virmap_wrapper.sh \ --readUnpaired "/path/to/my/sample/.fastq" \ --sampleName "sample_name" VirMap Output • Final Output file (e.g. nCoV_228_S83_L001_R1_001.final.fa) • Fasta file contained list of all assembled virus genomes and taxonomy information, if available. • Example fasta header: >gi|1369125429|gb|MG772934.1|Bat.SARS-like.coronavirus.isolate.bat-SL- CoVZXC21,.complete.genome.,...;taxId=1508227;length=29566;size=53956; maxScore=9709;maxAlign=9964;maxPossibleScore=15495;protScore=9709; maxProtAlign=9964;nuclScore=0;maxNuclAlign=0;letterOccupancy=71.54;hi ghDivergence=1 Aggregate Results • Give the script a list of results directories and it will aggregate data from running many samples into one Excel file • Whether the run is successful and returned usable results • For each sample • List of viruses, listed by NCBI taxonomy ID • The taxonomy classification hierarchy • De-duplicate read counts https://github.com/rsoftone/virmap/tree/master/aggreg_stats Aggregate number of errors for each sample Walltimes for running each step for individual sample Duration for Each Step 12 cores and 48 Gb Ram 24 cores and 112 Gb RAM Run Statistics • 94 samples from 93 individuals • Two of the 94 samples failed our first attempt at running VirMap • 8 samples failed de-duplication but still produced results • The total amount of SU used by the 92 samples is 5,894 SU • median of 62.5 SU ±2.5 SU standard error (se) per sample • Walltime: • 1:37 ±4 min se. for 12 cores and 48 Gb RAM, and • 1:22 ± 2.3 min. se for 24 cores and 112 Gb RAM Get de-duplicated read counts • In a table format. Columns include: • NCBI tax id (e.g. 1508227) • Size = number of deduplicated reads (e.g. 408) • Weak or strong match, whether the sequence can be assigned to a taxa, potential for mis-annotation etc… • Taxonomy lineage in text. For example: • Viruses -> Riboviria -> Nidovirales -> Cornidovirineae -> Coronaviridae -> Orthocoronavirinae -> Betacoronavirus -> Sarbecovirus -> Severe acute respiratory syndrome-related coronavirus -> Bat SARS-like coronavirus Heat map of 94 samples versus viruses identified – prelim. results Coronaviruses Summary Results • 63/92 (68%) samples had coronavirus of some sort (2019 reference database) as detected by VirMap • 81/92 (88%) samples had other viruses that are not coronaviruses or phage. • 55/92 (59.8%) samples had co-infection of Coronavirus and other viruses. • Endogenous retrovirus K, found, in 89/92 samples, not Coronaviruses shown in heat map Re-map Reads to Viral SeQuences to get Read Counts • Get read counts for each virus • Remove all reads that mapped to • The human genome • Bacterial Genbank division genomes • Bacteriophage Genbank division genomes • Map all remaining reads to the *.final.fa virus genomes assembled from the sample • Remove multi-mapped reads • Get read counts • Usage: ./get_read_counts.sh -g human_genome_bwa -p phage_bwa -b bacteria_bwa virmap_output.final.fa virmap_input_sequence.fq • https://github.com/rsoftone/virmap/blob/master/get_read_counts.sh Availability on Other Platforms • Katana • Duncan has attempted to install VirMap • Differences in CentOS version (Katana v6, Gadi v7) • Conda dependencies in Katana cannot be resolved • Microsoft Azure • VirMap will install in an Ubuntu instance • Need to transfer data into blob storage and transfer output back into blob storage • Around $300 test Credits for UNSW staff and students • Need to submit request for higher capacity machine, contact support. • Remember to switch off the instance when you don’t need it • Amazon • Original VirMap Github currently has instructions on how to install pipeline and reference DB in Amazon. (Was not available before in 2019.) Future Works • Comparison to other tools (e.g. IDSeq and Genome Detective) • Analyze more samples wit VirMap • Update sequence database to 2020 version on Gadi (for COVID19 analysis) • Possible developments: • Option to replace for BlastN with a faster tool MMseqs2 (https://github.com/soedinglab/MMseqs2) • Re-write parts of VirMap in a workflow scripting language (e.g. Snakemake or Nextflow) Acknowledgements UNSW ResTech • Virology Research Laboratory: • Prof. Maria Craig. • Prof. William Rawlinson • Sonia Isaacs (PhD) • Emily Gibson & Cynthia Lau (RA) • Clare Faulkner, Dylan Foskett, Jessica Lau (Students) • Coordinators: Jacki, Deb, Susan, Kelly, Jess • UNSW: • Prof. Marc Wilkins • Dr. Ignatius Pang • A/Prof. Rowena Bull • A/Prof. Fabio Luciani • UNSW ResTech (Simon, David, Martin, Duncan, Robin) • Center for Infection & Immunity: • Prof. Ian Lipkin • A/Prof. Thomas Briese • ENDIA Study: • Prof. Jennifer Couper (CI) • All investigators • Study coordinators • Study team • Microsoft (Andreas Wilm & Auda Eltahla) • Mothers & Children supporting ENDIA • National Collaborative Research Infrastructure Strategy (NCRIS) and Bioplatforms Australia Serology & Taxonomy data from NCBI Notes from Simon. Downloading and unzipping should produce the files: names.dmp nodes.dmp You'll need the following fields from each file, based on the VirMap krakenFilter.pl, which imports the following columns ($parents, $names, $ranks, $children) nodes.dmp tax_id -- node id in GenBank taxonomy database. The ‘children’ needs to be derived from reverse of ‘parents’, so some code is needed to perform this. parent tax_id -- parent node id in GenBank taxonomy database (I think this corresponds to the 'parents' field) rank -- rank of this node (superkingdom, kingdom, ... (I think this corresponds to the 'ranks' field) names.dmp tax_id -- the id of node associated with this name (use this field to merge with the nodes.dmp file) name_txt -- name itself (I think this corresponds to the ‘names’ field)..