Using Hidden Markov Models to Investigate G- Quadruplex Motifs in Genomic Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Yano and Kato BMC Genomics 2014, 15(Suppl 9):S15 http://www.biomedcentral.com/1471-2164/15/S9/S15 RESEARCH Open Access Using hidden Markov models to investigate G- quadruplex motifs in genomic sequences Masato Yano1, Yuki Kato2* From Asia Pacific Bioinformatics Network (APBioNet) Thirteenth International Conference on Bioinformatics (InCoB2014) Sydney, Australia. 31 July - 2 August 2014 Abstract Background: G-quadruplexes are four-stranded structures formed in guanine-rich nucleotide sequences. Several functional roles of DNA G-quadruplexes have so far been investigated, where their putative functional roles during DNA replication and transcription have been suggested. A necessary condition for G-quadruplex formation is the presence of four regions of tandem guanines called G-runs and three nucleotide subsequences called loops that connect G-runs. A simple computational way to detect potential G-quadruplex regions in a given genomic sequence is pattern matching with regular expression. Although many putative G-quadruplex motifs can be found in most genomes by the regular expression-based approach, the majority of these sequences are unlikely to form G- quadruplexes because they are unstable as compared with canonical double helix structures. Results: Here we present elaborate computational models for representing DNA G-quadruplex motifs using hidden Markov models (HMMs). Use of HMMs enables us to evaluate G-quadruplex motifs quantitatively by a probabilistic measure. In addition, the parameters of HMMs can be trained by using experimentally verified data. Computational experiments in discriminating between positive and negative G-quadruplex sequences as well as reducing putative G- quadruplexes in the human genome were carried out, indicating that HMM-based models can discern bona fide G- quadruplex structures well and one of them has the possibility of reducing false positive G-quadruplexes predicted by existing regular expression-based methods. Furthermore, our results show that one of our models can be specialized to detect G-quadruplex sequences whose functional roles are expected to be involved in DNA transcription. Conclusions: The HMM-based method along with the conventional pattern matching approach can contribute to reducing costly and laborious wet-lab experiments to perform functional analysis on a given set of potential G- quadruplexes of interest. The C++ and Perl programs are available at http://tcs.cira.kyoto-u.ac.jp/~ykato/program/ g4hmm/. Background in the antiparallel way. In contrast, several in vitro experi- Deoxyribonucleic acids (DNAs) are macromolecules that ments reveal the existence of non-B-DNA structures hold genetic information in almost all of the organisms. caused by particular sequence motifs and DNA-protein The bulk of existing DNA molecules is assumed to form a interactions. Well investigated examples include G-quad- right-handed double helical structure called B-DNA [1], ruplex (G4), Z-DNA, cruciform and triplex. Recent where each constituent bases A and C selectively bind to advances in providing in vitro evidence of these specific bases T and G, respectively, between two strands arranged structures develop the hypothesis that these structures are considered to have some functional roles in living cells [2]. A G4 structure is one of the topological conformations * Correspondence: [email protected] 2Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 that DNAs can adopt, where G-quartets, hydrogen-bonded Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan square planar substructures between four guanines (Gs), Full list of author information is available at the end of the article © 2014 Yano and Kato; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Yano and Kato BMC Genomics 2014, 15(Suppl 9):S15 Page 2 of 11 http://www.biomedcentral.com/1471-2164/15/S9/S15 Since little is known about the functions of G4 struc- tures and genome-scale wet-lab experiments with nuclear magnetic resonance (NMR) spectroscopy for structural analysis [9] are not feasible, several computational efforts have been made on identifying the locations of potential G4 sequences in genomic DNAs and inferring their func- tions by comparative sequence analysis using related genes with known functions [10,11]. In principle, G4 motifs can be represented by a regular expression G+N *G+N*G+N*G+, where ‘N’ shows an arbitrary base including G, ‘+’ denotes ‘ ’ Figure 1 An example of G4 structure.(a)Fourguaninesform at least one repeat of the preceding symbol and * means hydrogen bonds to their adjacent guanines, resulting in a G-quartet. at least zero repeats. Due to this simple pattern of G4 (b) Several G-quartets are stacked onto each other to form a G- motifs, several in silico methods have been proposed to quadruplex (G4) structure. A G4 structure has its sequence as detect G4 sequences in genomes using pattern matching alternate G-runs and loops. with regular expression [12-16]. Moreover, regular expres- sion-based methods that incorporate a simple scoring scheme are proposed [17-19]. Another computational study focuses on thermodynamic stability of G4 structures are stacked onto each other (see Figure 1). At the using Gaussian process regression [20]. Although the pat- sequence level, a G4 sequence can be represented by four tern matching approaches can detect many G4 motifs in regions of consecutive Gs that form G-quartets, called G- genomic sequences quite fast, it is pointed out that the runs,andthreeregionsofnucleotide subsequences that majorityofthesemotifsmaybefalsepositiveG4 connect G-runs, called loops, which can have varying sequences [21,22]. length including lack of loop [3]. Note that consecutive Gs In this contribution, we present more elaborate compu- may stretch over G-run and loop regions, making the pro- tational models than regular expression to represent G4 blem of predicting G-run regions somewhat complex. motifs, employing hidden Markov models (HMMs). Loops are known to play an important role in stability of HMMs are so flexible in modeling linear dependence that G4 structures [4]. In addition, it is pointed out that the they are widely used in bioinformatics including protein loop length affects therapeutic selectivity to target a G4 secondary structure prediction [23,24] and sequence motif instead of the topology of the G4 structure [5]. G4 struc- search [25]. To model G4 motifs, we provide four HMM- tures are also stabilized by monovalent cations, especially based models from the viewpoint of the number of hidden K+, located in the central cavities in the stack. states that describe G-runs and loops, and compare with Eukyariotic telomeric sequences include G-rich regions each other in three computational experiments. The first and they can form G4 structures in vitro. However, the preliminary experiment in predicting G-run regions in a question of how many such G-rich regions can actually set of 100 real G4 sequences in the literature [20] indicates form G4 structures in vivo has not been resolved. The that each HMM-based model can represent actual G-run potential to form G4 structures in telomeric sequences in regions well. The subsequent experiment in discriminating vivo can be shown by in vitro DNA binding experiments real and shuffled G4 sequences by using HMMs shows with those sequences [2]. For example, telomere end-bind- that the models considering detailed distributions of G- ing proteins in ciliates can control the formation of G4 run and loop lengths can outperform the simple probabil- DNA structures at telomeres [6]. Interestingly, however, a istic extension of regular expression. In the third test with recent study suggests that endogenous G4 structures in statistical analysis in discriminating highly likely G4 struc- human cells are present largely outside the telomeres [7]. tures from putative G4 motifs in the human pre-mRNA Another work reports that protruding nucleotides in sequences [26], the results show that the HMM-based human telomeric sequences destabilize the G4 structure model that can represent elaborate length distribution of and overhanging sequences influence the folding of the G-run regions outperforms the other three models pre- quadruplex [8]. Other examples of G-rich regions in gen- sented in this work. Moreover, the above model can be omes are transcriptional start sites, mitotic and meiotic specialized to detect G4 sequences whose functional roles double strand break sites. Although G4 structures have are expected to be involved in DNA transcription. Finally, stability with higher temperature than that of canonical this model in conjunction with pattern search is applied to double helix structures, many functional regions in geno- G4 screening in the whole human genome, producing a mic sequences have not a few G-rich motifs [2], motivat- considerably smaller number of G4 candidates with statis- ing us to investigate further the functional roles of G4 tical significance than that of G4 sequences predicted by structures. pattern matching alone. Yano and Kato BMC Genomics 2014, 15(Suppl 9):S15 Page 3 of 11