Searching for latent viruses in human whole-genome sequencing data
Supervisors (Genotek): Students: Valery Ilinsky Yura Orlov Alexander Rakitko Alisa Morshneva Nadezhda Pogodina Blood virome
DNA viruses
● Adenoviridae ● Baculoviridae ● Herpesviridae ● Marseilleviridae ● Myoviridae ● Polyomaviridae ● Papillomaviridae ● Poxviridae ● Siphoviridae ● Anelloviridae ● Inoviridae ● Micoviridae ● Parvoviridae
Popgeorgiev et al. 2013 Shotgun sequencing
Sampling → DNA extraction
bacterial viral dsDNA dsDNA
human dsDNA
thermofisher.com Goals
Characterize viral representation in human WGS data and search for possible association between viral load and genetic variations
Tasks
1. Explore literature data 2. Create a pipeline for WGS data analysis 3. Find open WGS databases 4. Test pipeline on different samples 5. Count viral load in testing data 6. Create a table of viral load for samples from 1000 Genomes 7. Perform GWAS Data samples
Population Number of 1. One woman's uterine tissue sample from samples Genotek for creating and testing pipeline CEU 101 2. WES and WGS samples from Genotek TSI 104
3. Samples from 1000 Genomes: cram files IBS 108
GBR 106
FIN 105
5 populations from 1000 Genomes have been analyzed Pipeline
for the analysis of 1000 Genomes samples (bash script)
Extraction of viral load unmapped Kraken2 estimation reads alignment of split merged file parsing of alignment files unmapped reads to forward and Kraken2 (CRAM) to RefSeq report reverse reads databases files
Extraction of assessment of decoy reads viral (EBV) representation
Samtools Patient with HPV infection Total diversity: Sample of a woman's uterine tissue that is rich in papillomavirus 16, caused cervical cancer
Virus-only subset: Viral load estimation
Approaches:
1. Using coverage values (median or average) 2. Using number of mapped reads Difficulty 3. Using number of reads classified by Kraken Viral representation in 1000 Genomes data
Number of samples
Total 524
Human gammaherpesvirus positive 524
Human mastadenovirus positive 123 EBV load in 1000 Genomes data
Group Number of samples
Low EBV 305
High EBV 219
Low viral High viral load load Mastadenovirus load in 1000 Genomes data
Group Number of samples
Data with 123 adenoviruses
Control 401 GWAS analysis (EBV)
SNP Gene P-value rs6710977 LOC105373958 : Intron Variant 8.795e-7 rs11780074 None 0.000001404 rs10532235 LOC643339 : Intron Variant 0.000001472 rs10808854 None 0.000002363 rs2033302 LOC102724691 : Intron Variant 0.000002396 GWAS analysis (Mastadenoviruses)
SNP Gene P-value rs28529415 None 6.983e-8 rs138432335 CDH4 : Intron Variant 1.187e-7 rs112308716 None 3.254e-7 rs529717656 ROCK1P1 : Non Coding 4.495e-7 Transcript Variant rs12681923 LOC105377793 : Intron 7.282e-7 Variant Project overall results
1. Pipeline for working with WGS data was created
2. Viral load of EBV and mastadenoviruses was counted
3. GWAS analyses was performed for 5 populations from 1000 Genomes Thank you for your attention!
Project GitHub link: https://github.com/Alisa1195/Searching-for-latent-viruses-in-human-whole-genome -sequencing-data