Simultaneous Characterization of Sense and Antisense Genomic

Nucleic Acids Research Advance Access published November 17, 2015 Nucleic Acids Research, 2015 1 doi: 10.1093/nar/gkv1184 Simultaneous characterization of sense and antisense genomic processes by the double-stranded hidden Markov model Julia Glas1,†, Sebastian Dumcke¨ 2,3,†, Benedikt Zacher1,2,†,DonPoron3, Julien Gagneur1 and Achim Tresch1,2,3,* 1Gene Center Munich and Department of Biochemistry, Ludwig-Maximilians-Universitat¨ Munchen,¨ Feodor-Lynen-Straße 25, 81377 Munich, Germany, 2Department of Plant Breeding and Genetics, Max Planck Institute for Plant Breeding Research, Carl-von-Linne-Weg´ 10, 50829 Cologne, Germany and 3Institute for Genetics, University of Cologne, Zulpicher¨ Str. 47b, 50674 Cologne, Germany Downloaded from Received June 19, 2015; Revised October 16, 2015; Accepted October 24, 2015 ABSTRACT group data points into a finite number of distinct ‘groups’ or ‘states’, according to some measure of similarity. Our Hidden Markov models (HMMs) have been exten- present work considers the analysis of data from several http://nar.oxfordjournals.org/ sively used to dissect the genome into functionally experiments whose output can be aligned to a genomic se- distinct regions using data such as RNA expression quence, such as RNA expression-, ChIP- or DNA methyla- or DNA binding measurements. It is a challenge to tion data. The main purpose is to cluster genomic positions disentangle processes occurring on complementary into ‘states’, which ideally correspond to distinct biological strands of the same genomic region. We present functions. Hidden Markov models (HMMs) have become the double-stranded HMM (dsHMM), a model for the the method of choice, since they additionally account for strand-specific analysis of genomic processes. We the dependency of consecutive observations introduced by applied dsHMM to yeast using strand specific tran- the sequential structure of the data. HMMs were success- at MPI Study of Societies on April 4, 2016 scription data, nucleosome data, and protein binding fully used for dissecting the genome into ‘chromatin states’ (1) or ‘transcription states’ (2). Recently, HMMs were em- data for a set of 11 factors associated with the regula- ployed to infer distinct genomic states from genome-wide tion of transcription.The resulting annotation recov- ChIP data in human (1,3–8), fly (9,10), Arabidopsis (11)and ers the mRNA transcription cycle (initiation, elonga- worm (12,13). However, the drawback of the HMMs used tion, termination) while correctly predicting strand- in these applications is their inability to integrate stand- specificity and directionality of the transcription pro- specific (e.g. RNA expression) with non-strand-specific (e.g. cess. We find that pre-initiation complex formation ChIP) data and thus limiting the analysis either to only non- is an essentially undirected process, giving rise to a strand-specific or only single-stranded data. This limitation large number of bidirectional promoters and to per- was first addressed in (14). There, the hidden Markov chain vasive antisense transcription. Notably, 12% of all is replaced by a dynamic Bayesian network (DBN). This al- transcriptionally active positions showed simultane- lows the modeling of strand-specific data and the introduc- ous activity on both strands. Furthermore, dsHMM tion of structured states, i.e. hierarchical labels for each position. Other approaches employed reversible HMMs (15), reveals that antisense transcription is specifically which were further extended by the bidirectional hidden suppressed by Nrd1, a yeast termination factor. Markov model (2). Still all of these models are unable to account for overlapping processes that might occur on both INTRODUCTION DNA strands. This situation is frequently encountered in compact genomes. For example, cryptic unstable transcripts The rapidly growing amount of heterogeneous data gen- (CUTs) and stable uncharacterized transcripts (SUTs) often erated by experimental high-throughput techniques makes overlap with annotated features (16). Neil et al. (17) showed integrative data analysis an essential part of molecular bi- that by far the most CUTs in yeast are antisense CUTs, i.e. ology. One major purpose is the creation of comprehen- CUTs that are transcribed from between tandem features in sive views on high dimensional data which are impossi- antisense direction. ble to obtain manually. To that end, clustering methods *To whom correspondence should be addressed. Tel: +49 89 21807634; Fax: +44 89 218076797; Email: [email protected] †These authors contributed equally to the work as the first authors. C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2015 In the present work we introduce double-stranded hidden Markov models (dsHMMs), which explicitly model the forward and reverse DNA strand by two Markov chains running in opposite directions. Therefore dsHMMs are able to disentangle the two strands at every single position of the genome. dsHMMs capture the biology of directed genomic processes such as transcription, which often occur at the same position on both strands. We illustrate the use of dsH- MMs on a data set comprised of strand-specific expression, nucleosome occupancy and ChIP-chip data of 11 factors involved in yeast transcription. We present the first strand- specific map of transcription states in yeast. Our results con- firm the role of Nrd1 in a non-canonical pathway of transcription termination, which predominately occurs in antisense direction of stable gene transcripts and is mainly involved in the termination of CUTs (16,18). Downloaded from MATERIALS AND METHODS Definition of the dsHMM Let P be a set of genome-wide experiments (‘tracks’) which http://nar.oxfordjournals.org/ give rise to a sequence of observables O = (o1, ..., oT), ot ∈ RP j ,whereot contains the measurement value of track j at position t. The observables are emitted by two independent, homogeneous Markov chains of hidden variables running S+ = +, ..., + in opposite direction, the forward chain (s1 sT ) S− = −, ..., − and the reverse chain (s1 sT ). The idea is that among a finite set of biological processes D, each state + ∈ D − ∈ D st (respectively st ) indicates which process is tak- ing place on the forward (respectively reverse) DNA strand. Figure 1. (A) Graphical representation of a dsHMM showing two hidden at MPI Study of Societies on April 4, 2016 { +,..., +} { −,..., −} state chains s1 sT and s1 sT which run in opposite direction The emission distribution of an observation ot is condition- +, − = ally independent of all other observations, given the hidden (white circles). Each state pair (st st ) emits an observation ot, t 1, ..., +, − T (gray circles). (B) Viterbi paths inferred from a synthetic data set us- state pair (st st ). A graphical specification of the double- ing three different HMM models. The top panel shows a simulated ChIP stranded hidden Markov model (dsHMM) is given in Fig- experiment (purple track) and an RNA-Seq experiment with forward (or- ure 1A. According to our assumptions, the joint likelihood ange track) and reverse strand (brown track) expression data for a genomic of a dsHMM factors into region containing partly overlapping genes (arrows in the middle panel) located on both DNA strands. The bottom panel shows the Viterbi paths P(O, S+, S−) = P(O | S+, S−) · P(S+) · P(S−)(1) obtained from the standard HMM, the bdHMM and the dsHMM with three hidden states: an intergenic state (gray) and two gene-specific states (red and green). For the bdHMM, equivalent forward and reverse states are T indicated by the same color, positioned either above (forward direction) or O | S+, S− = | +, − below (reverse direction) the undirected gray states. P( ) P(ot st st )(2) t=1 π ∈ RD T for some initial state distribution . For convenience, + + + + we require that A is ergodic, and that ␲ is its unique steady P(S ) = P(s ) P(s | s − ), 1 t t 1 state distribution. A dsHMM can then be transformed into t=2 (3) a standard HMM with state space D2 = D × D,withstate T S = , ..., = +, − ∈ D2 − − − − sequence (s1 sT), st (st st ) . The transition P(S ) = P(s ) P(s − | s ) D2×D2 T t 1 t matrix B = (brs) ∈ R becomes t=2 − + − + − = + + − − π − π 1 , = , , = , ∈ D2 It is natural to assume that state transitions happening brs ar s as r s r− r (r r ) s (s s ) (6) in forward direction on the forward strand have the same D2 probability as state transitions on the reverse strand hap- and the initial state distribution, τ ∈ R is τs = πs+ πs− pening in reverse direction, i.e. (see Supplementary Data Part I Section 1 for details). The + + − − dsHMM is a reversible HMM (see Supplementary Data P(s = j | s = i) = P(s = j | s = i) = a , i, j ∈ D t t−1 t−1 t ij (4) Part I Section 1 Remark 3). The transformation of the D×D dsHMM into a standard HMM will allow us to apply well- for some transition matrix A = (aij) ∈ R . By the same reasoning, we assume known, efficient techniques for HMM learning, namely the Forward-Backward, Viterbi and Baum–Welch algorithms + = = − = = π , ∈ D P(s1 i) P(sT i) i i (5) (see Supplementary Data Part I Sections 3–5). Nucleic Acids Research, 2015 3 It remains to find a sparse parametrization of the emis- expression data. The selected proteins cover most of the fac- 2 sion distributions ␺ s(o) = P(o|s), s ∈ D . Let the obser- tors involved in the mRNA transcription cycle. They in- vation space RP be the Cartesian product of the strand- clude initiation factors, different types of elongation factors unspecific observations RB, and the forward- respectively as well as termination factors.

Load more