Analyses of Viral Genomes for G-Quadruplex Forming Sequences Reveal Their Correlation with the Type of Infection
Total Page:16
File Type:pdf, Size:1020Kb
Biochimie 186 (2021) 13e27 Contents lists available at ScienceDirect Biochimie journal homepage: www.elsevier.com/locate/biochi Analyses of viral genomes for G-quadruplex forming sequences reveal their correlation with the type of infection 0 Natalia Bohalov a a, b, 1, Alessio Cantara a, b, 1, Martin Bartas c, Patrik Kaura d,Jirí St astný d, e, * Petr Pecinka c, Miroslav Fojta a, Jean-Louis Mergny a, f,Vaclav Brazda a, a Institute of Biophysics of the Czech Academy of Sciences, Kralovopolsk a 135, Brno, 612 65, Czech Republic b Department of Experimental Biology, Faculty of Science, Masaryk University, Kamenice 5, 62500, Brno, Czech Republic c Department of Biology and Ecology/Institute of Environmental Technologies, Faculty of Science, University of Ostrava, Ostrava, 710 00, Czech Republic d Brno University of Technology, Faculty of Mechanical Engineering, Technicka 2896/2, 616 69, Brno, Czech Republic e Department of Informatics, Mendel University in Brno, Zemedelska 1, Brno, 613 00, Czech Republic f Laboratoire d’Optique et Biosciences, Ecole Polytechnique, CNRS, INSERM, Institut Polytechnique de Paris, 91128, Palaiseau, France article info abstract Article history: G-quadruplexes contribute to the regulation of key molecular processes. Their utilization for antiviral Received 18 January 2021 therapy is an emerging field of contemporary research. Here we present comprehensive analyses of the Received in revised form presence and localization of putative G-quadruplex forming sequences (PQS) in all viral genomes 30 March 2021 currently available in the NCBI database (including subviral agents). The G4Hunter algorithm was applied Accepted 31 March 2021 to a pool of 11,000 accessible viral genomes representing 350 Mbp in total. PQS frequencies differ across Available online 9 April 2021 evolutionary groups of viruses, and are enriched in repeats, replication origins, 50UTRs and 30UTRs. Importantly, PQS presence and localization is connected to viral lifecycles and corresponds to the type of Keywords: G-quadruplex viral infection rather than to nucleic acid type; while viruses routinely causing persistent infections in fi Viral genome Metazoa hosts are enriched for PQS, viruses causing acute infections are signi cantly depleted for PQS. Bioinformatics The unique localization of PQS identifies the importance of G-quadruplex-based regulation of viral Persistent infection replication and life cycle, providing a tool for potential therapeutic targeting. Acute infection © 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND G4Hunter license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction transcription and influence viral latency [5]. Stabilization of G4 by specific G4 ligands, such as bisquinolinium derivatives (PhenDC3), G-quadruplexes (G4s) are stacked secondary nucleic acid porphyrin derivatives, perylenes and naphthalene diimides structures formed by G-rich DNA or RNA sequences. The building (PIPER), pyridostatin (PDS) or acridine derivatives (e.g., BRACO-19) block of a G4 is a G-quartet that arises via Hoogsteen base pairing of can modulate viral gene expression and, moreover, viral infection four guanines. Two or more G-quartets are stacked on top of each [6]. Therefore, G4 ligands can be considered as promising anti- other and physiologically stabilized mainly by monovalent cations, proliferative agents [7] and also as potential antiviral therapeutic such as potassium [1]. G4s have been found in various genome compounds [6]. locations such as telomeres, origins of replication and promoter Under physiological conditions, G4 formation is regulated by regions [2,3]. With respect to their presence in these particular G4-binding proteins [8]. This suggests their crucial roles in a areas of genomes, G4s are most likely involved in different cellular number of signaling pathways, including those relevant for viral events such as modulation of telomeric functions or regulation of infection (e.g., the immune response). For example nucleolin, the gene expression [2,4]. most abundant nucleolar phosphoprotein known to promote the In recent years, increasing evidence showed that G4s are also formation of G4 and inhibit the activity of the c-Myc gene [9], has involved in the control of key viral processes, such as replication or also been reported to stabilize the formation of G4 in LTR repeats of the human immunodeficiency virus 1 (HIV-1) virus and thus silence viral transcription [10]. Nucleolin also mediates immune evasion of Epstein-Barr virus (EBV) by stabilizing G4 in EBV- * Corresponding author. E-mail address: [email protected] (V. Brazda). encoded nuclear antigen-1 (EBNA1) mRNA [11] and suppresses 1 These authors contributed equally. replication of Hepatitis C Virus (HCV) by binding to RNA G4 [12]. https://doi.org/10.1016/j.biochi.2021.03.017 0300-9084/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). N. Bohalov a, A. Cantara, M. Bartas et al. Biochimie 186 (2021) 13e27 Furthermore, viral G4-binding proteins are crucial for the viral life 2.2. Process of analysis cycle; for example nucleocapsid proteins of the HIV-1 virus can function as molecular chaperones for G4 [13] and it the G4-binding We used the computational core of our DNA analyzer software macrodomain within the SARS unique domain (SUD) of SARS-CoV written in Java with G4Hunter algorithm implementation [19]. Nsp3 protein is essential for viral replication and transcription Parameters for G4Hunter were set to 25 nucleotides for window [14]. Targeting of viral G4 binding proteins is thus a prospective size and a threshold score of 1.2 or above. A separate list of PQS in antiproliferative and antiviral therapeutic strategy. For this pur- each of the sequences and an overall report were obtained. The pose, antiviral DNA aptamers have been designed [15]. overall results for each species group contained a list of species To assess the propensity of a sequence to form G4 and distribu- with size of genomic DNA, GC content, number of PQS found, fre- tion of these putative G4-forming sequences (PQS) in genomes, quency normalized for 1,000 nt and length of the sequence covered several algorithms have already been described. Chronologically, the with PQS (Supplementary Material 2). PQS were also grouped first generation algorithm searches for the [GnNmGnNoGnNpGn] depending on their G4Hunter score in the five following intervals: motif in a sequence [16]. The second generation algorithm takes into 1.2e1.4, 1.4e1.6, 1.6e1.8, 1.8e2.0 and 2.0 or more (the higher the consideration occurrence of repeating units of Gn (n 2) [17]. score, the more likely/more stable is the G4). Nevertheless, both algorithms produce binary (yes/no; match/no match) results, rather than the qualitative analyses that are neces- 2.3. Statistical analysis sary for correlation of G4 strength metrics. G4Hunter was developed to overcome this limitation, and it calculates G4 propensity Statistical evaluations of differences in PQS between phyloge- depending on the asymmetrical distribution of G and C (GC-skew- netic groups, viruses grouped by the type of infection and features ness) [18]. were made by KruskaleWallis test. Post-hoc multiple pairwise Our aim was to analyze for the first time a pool of all accessible comparison by Dunn’s test with Bonferroni correction of the sig- 11,000 viral genomes searching for the presence of PQS with the nificance level was applied with p-value cut-off 0.05. Data are G4Hunter web application [19]. Viruses are small infectious agents available in Supplementary Material 3. A cluster dendrogram of PQS that replicate inside a living cell of an organism. According to the characteristics was constructed in the program R, version 3.6.3, Baltimore classification they can be divided into classes according using the pvclust package to further reveal and graphically depict to their genome: double-stranded DNA (dsDNA) viruses, single- similarities between particular viral families. The following values stranded DNA (ssDNA) viruses, double-stranded RNA (dsRNA) vi- were used as input data: Mean f (mean of predicted PQS per ruses, positive sense RNA (ssRNAþ) viruses, negative stranded RNA 1,000 nt), Min f (the lowest frequency of predicted PQS per (ssRNA-) viruses, RNA reverse transcribing viruses and DNA reverse 1,000 nt), Max f (the highest frequency of predicted PQS per transcribing viruses. In addition, there are also subviral agents such 1,000 nt) and Cov % (% of the genome covered by PQS with a as viroids and satellites. Viroids are small circular ssRNA molecules, threshold of 1.2 or more) (Supplementary Material 4). The which do not encode any protein. All known viroids infect higher following parameters were used for both analyses: cluster method plants and represent dangerous agricultural pathogens [20]. Sat- ‘ward.D2’, distance ‘euclidean’, number of bootstrap resampling ellites are also mainly found in plants; their replication depends on was 10,000. Statistically significant clusters (based on approxi- co-infection with an independent helper virus [21]. Evidence of the mately unbiased values above 95, equivalent to p-values less than presence of PQS has already been reported in several viruses from 0.05) are highlighted by rectangles marked with broken red lines. R different families; for example dsDNA viruses from Herpesviridae or code is provided in Supplementary Material 4. Receiver operating Papillomaviridae, ssRNAþ viruses from Flaviviridae or Filoviridae, characteristic (ROC) curves were constructed to determine the area RNA viruses with reverse transcriptase (RT) Retroviridae and also under the curve (AUC) for genome groups and viral families causing dsDNA with RT Hepadnaviridae [6,22]. The presence of PQS and G4 either persistent or acute type of infection through web-based ROC functions in viruses remains poorly understood and until now has [26]. Correlation was assessed by non-parametric Spearman’s cor- been mainly limited to studies of eukaryotic viruses by first gen- relation coefficient.