Nucleic Acids Research Advance Access published November 17, 2015 Nucleic Acids Research, 2015 1 doi: 10.1093/nar/gkv1184 Simultaneous characterization of sense and antisense genomic processes by the double-stranded hidden Markov model Julia Glas1,†, Sebastian Dumcke¨ 2,3,†, Benedikt Zacher1,2,†,DonPoron3, Julien Gagneur1 and Achim Tresch1,2,3,*
1Gene Center Munich and Department of Biochemistry, Ludwig-Maximilians-Universitat¨ Munchen,¨ Feodor-Lynen-Straße 25, 81377 Munich, Germany, 2Department of Plant Breeding and Genetics, Max Planck Institute for Plant Breeding Research, Carl-von-Linne-Weg´ 10, 50829 Cologne, Germany and 3Institute for Genetics, University of Cologne, Zulpicher¨ Str. 47b, 50674 Cologne, Germany Downloaded from Received June 19, 2015; Revised October 16, 2015; Accepted October 24, 2015
ABSTRACT group data points into a finite number of distinct ‘groups’ or ‘states’, according to some measure of similarity. Our Hidden Markov models (HMMs) have been exten- present work considers the analysis of data from several http://nar.oxfordjournals.org/ sively used to dissect the genome into functionally experiments whose output can be aligned to a genomic se- distinct regions using data such as RNA expression quence, such as RNA expression-, ChIP- or DNA methyla- or DNA binding measurements. It is a challenge to tion data. The main purpose is to cluster genomic positions disentangle processes occurring on complementary into ‘states’, which ideally correspond to distinct biological strands of the same genomic region. We present functions. Hidden Markov models (HMMs) have become the double-stranded HMM (dsHMM), a model for the the method of choice, since they additionally account for strand-specific analysis of genomic processes. We the dependency of consecutive observations introduced by applied dsHMM to yeast using strand specific tran- the sequential structure of the data. HMMs were success- at MPI Study of Societies on April 4, 2016 scription data, nucleosome data, and protein binding fully used for dissecting the genome into ‘chromatin states’ (1) or ‘transcription states’ (2). Recently, HMMs were em- data for a set of 11 factors associated with the regula- ployed to infer distinct genomic states from genome-wide tion of transcription.The resulting annotation recov- ChIP data in human (1,3–8), fly (9,10), Arabidopsis (11)and ers the mRNA transcription cycle (initiation, elonga- worm (12,13). However, the drawback of the HMMs used tion, termination) while correctly predicting strand- in these applications is their inability to integrate stand- specificity and directionality of the transcription pro- specific (e.g. RNA expression) with non-strand-specific (e.g. cess. We find that pre-initiation complex formation ChIP) data and thus limiting the analysis either to only non- is an essentially undirected process, giving rise to a strand-specific or only single-stranded data. This limitation large number of bidirectional promoters and to per- was first addressed in (14). There, the hidden Markov chain vasive antisense transcription. Notably, 12% of all is replaced by a dynamic Bayesian network (DBN). This al- transcriptionally active positions showed simultane- lows the modeling of strand-specific data and the introduc- ous activity on both strands. Furthermore, dsHMM tion of structured states, i.e. hierarchical labels for each po- sition. Other approaches employed reversible HMMs (15), reveals that antisense transcription is specifically which were further extended by the bidirectional hidden suppressed by Nrd1, a yeast termination factor. Markov model (2). Still all of these models are unable to ac- count for overlapping processes that might occur on both INTRODUCTION DNA strands. This situation is frequently encountered in compact genomes. For example, cryptic unstable transcripts The rapidly growing amount of heterogeneous data gen- (CUTs) and stable uncharacterized transcripts (SUTs) often erated by experimental high-throughput techniques makes overlap with annotated features (16). Neil et al. (17) showed integrative data analysis an essential part of molecular bi- that by far the most CUTs in yeast are antisense CUTs, i.e. ology. One major purpose is the creation of comprehen- CUTs that are transcribed from between tandem features in sive views on high dimensional data which are impossi- antisense direction. ble to obtain manually. To that end, clustering methods
*To whom correspondence should be addressed. Tel: +49 89 21807634; Fax: +44 89 218076797; Email: [email protected] †These authors contributed equally to the work as the first authors.