Searching for latent in whole- sequencing data

Supervisors (Genotek): Students: Valery Ilinsky Yura Orlov Alexander Rakitko Alisa Morshneva Nadezhda Pogodina Blood virome

DNA viruses

● Marseilleviridae ● ● Inoviridae ● Micoviridae ●

Popgeorgiev et al. 2013 Shotgun sequencing

Sampling → DNA extraction

bacterial viral dsDNA dsDNA

human dsDNA

thermofisher.com Goals

Characterize viral representation in human WGS data and search for possible association between viral load and genetic variations

Tasks

1. Explore literature data 2. Create a pipeline for WGS data analysis 3. Find open WGS databases 4. Test pipeline on different samples 5. Count viral load in testing data 6. Create a table of viral load for samples from 1000 7. Perform GWAS Data samples

Population Number of 1. One woman's uterine tissue sample from samples Genotek for creating and testing pipeline CEU 101 2. WES and WGS samples from Genotek TSI 104

3. Samples from 1000 Genomes: cram files IBS 108

GBR 106

FIN 105

5 populations from 1000 Genomes have been analyzed Pipeline

for the analysis of 1000 Genomes samples (bash script)

Extraction of viral load unmapped Kraken2 estimation reads alignment of split merged file parsing of alignment files unmapped reads to forward and Kraken2 (CRAM) to RefSeq report reverse reads databases files

Extraction of assessment of decoy reads viral (EBV) representation

Samtools Patient with HPV infection Total diversity: Sample of a woman's uterine tissue that is rich in papillomavirus 16, caused cervical cancer

Virus-only subset: Viral load estimation

Approaches:

1. Using coverage values (median or average) 2. Using number of mapped reads Difficulty 3. Using number of reads classified by Kraken Viral representation in 1000 Genomes data

Number of samples

Total 524

Human gammaherpesvirus positive 524

Human positive 123 EBV load in 1000 Genomes data

Group Number of samples

Low EBV 305

High EBV 219

Low viral High viral load load Mastadenovirus load in 1000 Genomes data

Group Number of samples

Data with 123 adenoviruses

Control 401 GWAS analysis (EBV)

SNP Gene P-value rs6710977 LOC105373958 : Intron Variant 8.795e-7 rs11780074 None 0.000001404 rs10532235 LOC643339 : Intron Variant 0.000001472 rs10808854 None 0.000002363 rs2033302 LOC102724691 : Intron Variant 0.000002396 GWAS analysis ()

SNP Gene P-value rs28529415 None 6.983e-8 rs138432335 CDH4 : Intron Variant 1.187e-7 rs112308716 None 3.254e-7 rs529717656 ROCK1P1 : Non Coding 4.495e-7 Transcript Variant rs12681923 LOC105377793 : Intron 7.282e-7 Variant Project overall results

1. Pipeline for working with WGS data was created

2. Viral load of EBV and mastadenoviruses was counted

3. GWAS analyses was performed for 5 populations from 1000 Genomes Thank you for your attention!

Project GitHub link: https://github.com/Alisa1195/Searching-for-latent-viruses-in-human-whole-genome -sequencing-data