041020V2.Full.Pdf

bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Accurate promoter and enhancer identiﬁcation in 127 ENCODE and Roadmap Epigenomics cell types and tissues by GenoSTAN

Benedikt Zacher1,∗, Margaux Michel4, Björn Schwalb4, Patrick Cramer4, Achim Tresch2,3,∗∗,Julien Gagneur1,5,∗∗∗

February 24, 2016

1. Feodor-Lynen-Str. 25, Munich, Germany, Gene Center and Department of Biochemistry, Center for Integrated Protein Science CIPSM, Ludwig-Maximilians-Universität München. 2. Zülpicher Str. 47, Cologne, Germany, Department of Biology, University of Cologne. 3. Carl-von-Linne-Weg 10, 24105 Cologne, Germany, Max Planck Institute for Plant Breeding Research. 4. Am Fassberg 11, Göttingen, Germany, Department of Molecular Biology, Max Planck Institute for Biophysical Chemistry 5. Present address: Technische Universität München, Department of Informatics, Boltzmannstr. 3, 85748 Garching, Germany

*Corresponding Author: Benedikt Zacher, Tel.: +49-89-2180-76740; Fax: +49-89-2180-76797; E-mail: [email protected]

**Corresponding Author: Achim Tresch, Tel.: +49-221-470-8292; Fax: +49-221-5062-163; E-mail: [email protected]

*** Corresponding Author: Julien Gagneur, Tel.: +49-89-2891-19411; Fax: +49-89-2891-19414; E-mail: [email protected]

1 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Abstract

Accurate maps of promoters and enhancers are required for understanding transcriptional regulation. Promoters and enhancers are usually mapped by integration of chromatin assays charting histone modifications, DNA accessibility, and transcription factor binding. However, current algorithms are limited by unrealistic data distribution assumptions. Here we propose GenoSTAN (Genomic STate ANnotation), a hidden Markov model overcoming these limitations. We map promoters and enhancers for 127 cell types and tissues from the ENCODE and Roadmap Epigenomics projects, today’s largest compendium of chromatin assays. Extensive benchmarks demonstrate that GenoSTAN consistently identifies promoters and enhancers with significantly higher accuracy than previous methods. Moreover, GenoSTAN-derived promoters and enhancers showed significantly higher enrichment of complex trait-associated genetic variants than current annotations. Altogether, GenoSTAN provides an easy-to-use tool to define promoters and enhancers in any system, and our annotation of human transcriptional cis-regulatory elements constitutes a rich resource for future research in biology and medicine.

2 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Introduction

Transcription is tightly regulated by cis-regulatory DNA elements known as promoters and enhancers. These elements control development, cell fate and may lead to disease if impaired. A promoter is functionally defined as a region that regulates transcription of a gene, located upstream and in close proximity to the transcription start sites (TSSs) [3]. In contrast, an enhancer was originally functionally defined as a DNA element that can increase expression of a gene over a long distance in an orientation-independent fashion relative to the gene [8]. The functional definition of enhancers and promoters leads to practical difficulties for their genome-wide identification because the direct measurement of the regulatory activity of genomic regions is hard, with current approaches leading to contradicting results [35, 30, 7]. Since the direct measurement of cis-regulatory activity is challenging, a biochemical characterization of the chromatin at these elements based on histone modifications, DNA accessibility, and transcription factor binding has been proposed [53, 6, 17, 26, 33]. This approach leverages extensive genome-wide datasets of chromatin- immunoprecipitation followed by sequencing (ChIP-Seq) of transcription factors (TFs), histone modifications, or Cap analysis gene expression (CAGE) that have been generated by collaborative projects such as ENCODE [13, 61], NIH Roadmap Epigenomics [12], BLUEPRINT [1] and FANTOM [5, 14]. In this context, the computational approaches employed to classify genomic regions as enhancers or promoters play a decisive role [53, 33]. As the experimental data are heterogeneous, we generally refer to them as tracks. Sev- eral studies used supervised learning techniques to predict enhancers based on tracks such as histone modifications or P300 binding (e.g. [32, 39, 51, 60]). However, a training set of validated enhancers is needed in this case, which is hard to define since only few enhancers have been validated experimentally so far and these might be biased towards specific enhancer subclasses. Alternatively, unsupervised learning algorithms were developed to identify promoters and enhancers from combinations of histone marks and protein-DNA interactions alone [17, 20, 27, 26, 64, 29, 12, 13]. These unsupervised methods perform genome segmentation, i.e. they model the genome as a succession of segments in different chromatin states defined by characteristic combinations of histone marks and protein-DNA interactions found recurrently throughout the genome. All popular genome segmentations are based on hidden Markov models [49]. However, these methods differ in the way the distribution of ChIP-seq signals for each chromatin state is modeled. ChromHMM [17, 18, 20], one of the two methods applied by the ENCODE consortium, requires binarized ChIP-seq signals that are then modeled with independent Bernoulli distributions. Consequently, the performance of ChromHMM highly depends on the non-trivial choice of a proper binarization cutoff. Moreover, quantitative information is lost with this approach, which is especially important for distinguishing promoters from enhancers since these elements are both marked with H3K4me1 and H3K4me3, but at different ratios [2]. Segway [26, 27], the other method applied by the ENCODE consortium, uses independent Gaussian distributions of log-transformed and smoothed ChIP-seq signal. Although Segway preserves some quantitative information, the transformation of the original count data leads to variance estimation difficulties for very low counts. Therefore, Segway further makes the strong assumption that all tracks have the same variance. Recently, EpicSeg [43] used a negative multinomial distribution to directly model the read counts without the need for data transformations. However, similar to the variance model of Segway, the EpicSeg model leads to a common dispersion (the parameter adjusting the variance of the negative multinomial) for all tracks. Moreover, EpicSeg does not provide a way to correct for sequencing depth, which makes the application to data sets with multiple cell types with varying library sizes difficult. Also, EpicSeg has been applied only to three cell types so far [43]. These methods not only differ in their modeling assumptions but also lead to very different results. In the K562 cell line for instance, ChromHMM identified 22,323 enhancers

3 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

[13], Segway 38,922 enhancers [13], and EpicSeg 53,982 enhancers [43]. Altogether, improved methods and detailed benchmarkings are required for a reliable annotation of transcriptional cis-regulatory elements.

Here we propose a new unsupervised genome segmentation algorithm, GenoSTAN (Genomic State Annotation from sequencing experiments), which overcomes limitations of current state-of-the-art models. GenoSTAN learns chromatin states directly from sequencing data without the need of data transformation, while still having track- speciﬁc variance models. We applied GenoSTAN to a total of 127 cell types and tissues covering 16 datasets of ENCODE and all 111 datasets of the Roadmap Epigenomics project as well as another ENCODE ChIP-seq dataset for the K562 cell line. GenoSTAN consistently performed better when benchmarked against Segway, ChromHMM and EpicSeg segmentations using independent evidence for activity of promoter and enhancer regions. Co-binding analysis of TFs reveals that promoters and enhancers both shared the Polymerase II core transcription machinery and general TFs, but they are bound by distinct TF regulatory modules and diﬀer in many biophysical properties. Moreover, GenoSTAN enhancer and promoter annotations had a higher enrichment for complex trait-associated genetic variants than previous annotations, demonstrating the advantage of GenoSTAN and our chromatin state map to understand genotype-phenotype relationships and genetic disease.

Results

Modeling of sequencing data with Poisson-lognormal and negative binomial distributions

We developed a new genomic segmentation algorithm, GenoSTAN, which implements hidden Markov models with more flexible multivariate count distributions than previously proposed. Specifically, GenoSTAN supports two multivariate discrete emission functions, the Poisson-lognormal distribution and the negative binomial distribution. For the sake of reducing running time, the components of these multivariate distributions are assumed to be independent. However, the variance is modeled separately for each state and each track, which provides a more realistic variance model than current approaches. To be applicable to data sets with replicate experiments or multiple cell types, GenoSTAN corrects for different library sizes of experiments (Methods). All parameters are learnt directly from the data, leaving the number of chromatin states as the only parameter to be manually set. We provide an efficient implementation of the Baum-Welch algorithm for inference of model parameters, which can be run in a parallelized fashion using multiple cores. The method is implemented as part of our prevsiouy published R/Bioconductor package STAN [62], which is freely available from http://bioconductor.org/. Altogether, GenoSTAN uniquely combines flexible count distributions, short running times, and minimal number of manually entered parameters (Figure 1A). We performed an extensive benchmarking of GenoSTAN against alternative methods (Figure 1B). Benchmark I compares GenoSTAN and the three alternative methods for the K562 cell line. K562 is a major model system to study human transcription and the ENCODE cell line with the largest number of experiments [13]. Moreover, all three other methods (ChromHMM, Segway, and EpicSeg) had been run by their own authors for K562, so that the algorithm parameters for these annotations can be assumed to have been set at best expert knowledge. Benchmark II compares methods for all 127 cell types and tissues provided by ENCODE and Roadmap Epigenomics. This is the largest benchmark dataset of all three we considered but also the one with the least number of common tracks (5 chromatin marks: H3K4me1, H3K4me3, H3K36me3, H3K27me3, and H3K9me3, Figure 1B). Benchmark III

4 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

compares results for a subset of 20 ENCODE and Roadmap Epigenomics cell types and tissues which had moreover H3K27ac, H3K9ac ChIP-seq and DNAse I hypersensitivity data (DNAse-Seq). This distinction allowed to provide more accurate annotations for the better characterized cell types and tissues. For benchmark II and benchmark III only annotations for the method ChomHMM were available to compare against GenoSTAN annotation. Figure 1B lists all model names and studies that were considered.

Benchmark I: Improved chromatin state annotation in human K562 cells

We first fitted two GenoSTAN models, one with Poisson-lognormal emissions (henceforth referred to as GenoSTAN- Poilog-K562 model) and one with negative binomial emissions (GenoSTAN-nb-K562 model) to a dataset of ChIP-seq data of 9 histone modifications, of the histone acetyltransferase P300, and DNA accessibility (by DNase-Seq) data for the K562 cell line at 200 bp binning resolution (Methods). As pointed out by others [17, 26], there is no purely statistical criterion for choosing the number of states from the data of practical usage in such a setting. In practice, the number of states is manually defined by trading off goodness of fit against interpretability of the model [17, 26, 62]. For GenoSTAN-Poilog-K562, we used 18 chromatin states. For GenoSTAN-nb-K562, we used 23 states, since lower state numbers did not provide enough resolution to give a fine-grained map of chromatin states on this data set (Methods). Figure 2A compares the two GenoSTAN segmentations to segmentations from other studies using ChromHMM, Segway and EpicSeg on a region containing the TAL1 gene together with three known enhancers [43, 27, 13, 25].

Chromatin states recover biologically meaningful features

In order to assign biologically meaningful labels to each state of the Poilog-K562 and of the nb-K562 GenoSTAN models, we investigated their read coverage distributions and overlapped the occurrence of a state in the genome with known genomic features. In line with previous studies, this led to the definition of promoter, enhancer, repressed, actively transcribed and low coverage states [43, 20, 27]. The median read coverage in state segments and genomic distributions were very similar for both the Poilog-K562 and the nb-K562 models (Figure 2B, Supplementary Figure 1). Promoter states were characterized by a low (< 1) H3K4me1/H3K4me3 ratio, in contrast to enhancer states which showed a high ratio (> 1). Further, P300 levels were roughly two-fold higher in the enhancer state, which is in accordance with previous observations [2, 45, 55]. Promoter (Prom) states were located close to annotated GENCODE TSSs [24], with a median distance of 220 bp for GenoSTAN-Poilog-K562 and 400 bp for GenoSTAN-nb- K562 model. Enhancer states (Enh) on the other hand were located further away from TSSs, with a median distance of 3.6 kb (Poilog-K562) (respectively 5.8 kb for nb-K562, Figure 2B, Supplementary Figure 1). Promoter and enhancer states also differed in their DNA sequence features. 45% of CpG islands were located within promoter states (strong, weak and promoter flanking states) in both models, but only 3% in enhancer states (strong, weak, enhancer flanking states, Figure 2B, Supplementary Figure 1). While promoter states mostly recovered stable TSSs, enhancer states were located at unstable TSSs (GRO-cap TSSs that are not recovered by the GENCODE annotation), which supports previous findings [16]. Furthermore, both models (GenoSTAN-nb-K562 and GenoSTAN-Poilog-K562) contained 3 states, that we classified as “actively transcribed states”, which were characterized by high values of H3K36me3 and overlap with UTRs, introns and exons. Two out of three “transcribed” states were also enriched in promoter associated marks (H3K4me1-3, H3K27ac, H3K9ac) and H4K20me1 and thus represented 5’ transitions in transcription. Moreover, both models fitted four repressed states showing high read coverage of H3K27me3. Two of these states also exhibited high DNase-Seq and promoter/enhancer associated histone modification signals,

5 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

suggesting that these states might reﬂect repressed regulatory regions (ReprEnh, ReprD). These elements were distal to annotated GENCODE TSSs (median distance: 5.2-11.8 kb). ReprEnh states were also enriched in P300 and recovered 0.2% of CpG islands, while ReprD states had lower P300 levels and recovered 8-9% of CpG islands in the genome (Figure 2B, Supplementary Figure 1). The remaining states exhibited low coverage in chromatin marks and therefore were labeled as “low” states. Altogether, GenoSTAN accurately recovered many features of known chromatin states and provided a high resolution map of these in K562.

High variation of enhancer predictions between chromatin state annotations of diﬀerent studies

To assess the consistency of promoter and enhancer predictions across studies, we compared the GenoSTAN segmentations to other published segmentations in K562 by ChromHMM (’ChromHMM-ENCODE’ [13, 27] and ’ChromHMM-Nature’ [20]), Segway (’Segway-ENCODE’ [13, 27], ’Segway-nmeth’ [26] and ’Segway-Reg.Build’ [64]) and EpicSeg [43]. We computed pairwise Jaccard indices (the ratio of the number of common elements over all elements predicted by two methods) of promoter and enhancer states to quantify the agreeement between the predictions of the different studies (Supplementary Figure 2). Promoter state annotations generally agreed well (median Jaccard-Index: 0.78). However, enhancer prediction varied more (median Jaccard Index: 0.48), suggesting that enhancers are more difficult to annotate. This variation of enhancer calls was also reflected in the different numbers of annotated enhancer segments, which had been shown to vary greatly between different prediction methods [32]. The number of enhancer segments ranged from 10,932 segments in GenoSTAN-Poilog-K562 to 80,043 segments in one Segway annotation [26] (Supplementary Table 1). Therefore, a thorough assessment of these predictions was necessary to provide a robust and accurate prediction of these elements.

Comparison of GenoSTAN with published chromatin state annotations

In order to benchmark the different segmentations, we used independent data including evidence of transcriptional activity (GRO-cap TSSs [16]), of transcription factor binding (ENCODE high occupancy target, or HOT regions [61], and ENCODE TF binding sites [13]), and of cis-regulatory activity (enhancer activity assessed by reporter assays [30]), which are all expected to be characteristics of promoters and enhancers. Transcription initiation activity is not only the hallmark of promoters, but also of enhancers [31, 5, 14, 16]. To benchmark the predictions using evidence for transcription, we used published data from a protocol called GRO-cap [16], a nuclear run-on protocol, which very sensitively maps transcription start sites genome-wide. To this end, we sorted for each method chromatin states by their overlap with GRO-cap TSSs by decreasing precision. Starting with the most precise state (i.e. highest overlap with TSSs) we calculated cumulative recall and false discovery rate (FDR) by subsequently adding states with decreasing precision (Figure 3A). GenoSTAN-Poilog-K562 had the highest recall and the lowest FDR (Methods, Figure 3A). GenoSTAN-nb-K562 performed similar to other segmentations (Segway-Reg. Build, ChromHMM-ENCODE). In particular, 94% of GenoSTAN-Poilog-K562 promoters (Prom.11) and 81% of its enhancer regions (Enh.15) overlapped with GRO-cap TSSs. This compares to 85% (Prom.16) and 65% (Enh.6) of GenoSTAN-nb-K562 and 89% (Tss) and 52% (Enh) of ChromHMM-ENCODE promoter and enhancer regions. Interestingly, the two ChromHMM segmentations (ChromHMM-ENCODE [27, 13], ChromHMM-Nature [20]) had very different accuracies for TSSs, which might be due to different data sets or cutoffs used for ChIP-seq binarization in the two studies. In contrast, the overall accuracy of the Segway annotations was comparable across studies. This comparison shows that GenoSTAN chromatin state annotation identifies putative promoters and enhancers which show transcriptional activity more frequently than previous annotations.

6 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

GRO-cap is a very sensitive method that captures also a large amount of TSSs of unstable transcripts. However it is limited to capped RNA species, misses RNAs below the detection threshold and cannot be used to validate repressed (i.e. transcriptionally inactive) regulatory elements. To address these shortcomings we used two additional independent features, TF binding and HOT regions. The binding of TFs to a region of DNA is a pre-requisite for potential regulatory function and transcriptional activity. High Occupancy of Target (HOT) regions are genomic regions which are bound by a large number of diﬀerent transcription-related factors [61], which were shown to function as enhancers [34] and are enriched in disease- and trait-associated genetic variants [40]. As for the benchmark with TSSs, we sorted chromatin states by overlap with HOT regions by decreasing precision and calculated cumulative recall and FDR (Figure 3B). The best performing segmentations for HOT regions were GenoSTAN-Poilog-K562 and GenoSTAN-nb-K562, followed by ChromHMM-ENCODE. The ordering of states with HOT regions was indeed different from the GRO-cap TSSs benchmark. Additionally to GenoSTAN promoter and enhancer states, the repressed enhancer state frequently overlapped with HOT regions with an overall precision of 81% (GenoSTAN-Poilog-K562) and 77% (GenoSTAN-nb-K562). In comparison, the top three ChromHMM-ENCODE states had together a precision of 67%. All other segmentation methods showed a lower precision and recall for HOT regions. This was also reﬂected in the frequency of individual TF binding sites at enhancer regions, which were generally higher in GenoSTAN enhancer states than in other segmentations (Figure 3C). In particular, only a very small fraction of EpicSeg and Segway-nmeth enhancers were found to be bound by TFs. EpicSeg and Segway-nmeth segmentations were also those with the highest number of predicted enhancers, suggesting that many of these predictions are spurious.

Next, we calculated the recall of FANTOM5 promoters [14] and enhancers [5] to assess how well the models distinguish promoters from enhancers, as it was evident from inspection of specific examples that this distinction was difficult to be established by current methods (Figure 2A). The FANTOM5 consortium have performed extensive mapping of capped transcripts 5’ ends using CAGE and defined enhancers and promoters based on transcriptional activity pattern. FANTOM5 enhancers were defined as regions showing balanced bidirectional capped transcripts, a hallmark of enhancer RNAs [5], whereas FANTOM5 promoters were defined as regions where transcription was biased towards one direction. The FANTOM5 annotation of enhancers and promoters could not entirely replace a chromatin state based approach because (i) the use of expression data in FANTOM5 limits the identified regulatory regions to transcriptionally active elements and (ii) CAGE was shown to be not as sensitive to rapidly degraded transcripts as GRO-cap and therefore might miss regulatory enhancers with unstable transcripts [16]. Nonethe- less, FANTOM5 provides an annotation of enhancers and promoters based on independent data that is well suited to assess how well the models distinguish promoters from enhancers. We filtered the FANTOM5 annotation to promoters and enhancers for activity in K562 by overlapping them with DHS [13] and GRO-cap TSSs [16]. We considered that a promoter state performed well, when the recall of FANTOM5 promoters was high and the recall of FANTOM5 enhancers was low and vice versa for enhancer states. GenoSTAN-Poilog-K562 and ChromHMM-nature enhancer states recall most FANTOM5 enhancers (60%, Figure 3D, Supplementary Table 2). For enhancer states, the recall of FANTOM5 promoters was around 10% except for those of Segway-ENCODE, which recalls almost 35% of FANTOM5 promoters and EpicSeg which 21% FANTOM5 promoters. In accordance with this, many promoter regions were erroneously classified as enhancer regions in this segmentation (e.g. TAL1 promoter in Figure 2A). The recall of FANTOM5 enhancers by promoter states was generally higher (17% - 37%). GenoSTAN-Poilog-K562 and

7 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

-nb-K562 recalled more than 90% of FANTOM5 promoters and around 20% of FANTOM5 enhancers which is comparable to other studies (Segway-nmeth, ChromHMM-nature, EpicSeg). ChromHMM-ENCODE promoter states had a comparable recall of FANTOM5 promoters (92%), but higher recall of FANTOM5 enhancers (37%) (Figure 3D). This strong overlap of ChromHMM-ENCODE promoters with FANTOM5-labeled enhancers is in accordance with our observation that some enhancer regions were errenuously classiﬁed as promoters in ChromHMM-ENCODE (Figure 2A). These results show that GenoSTAN segmentations distinguish promoters from enhancers at similar or better accuracy than other segmentations.

So far we only used indirect evidence (TSSs, HOT regions, TF binding, FANTOM5 enhancer) to draw conclusions about the cis-regulatory activity of a candidate enhancer. As additional and direct evidence for the cis-regulatory activity of enhancer regions inferred by GenoSTAN, we overlapped our enhancers to genomic sequences that were previously tested for cis-regulatory activity in a reporter assay, where candidate elements had been cloned into a plasmid upstream of the promoter of a reporter gene [30]. Enhancers from GenoSTAN segmentations showed signiﬁcantly higher activity than repressed or low coverage regions (GenoSTAN-Poilog-K562 & GenoSTAN-nb- K562: p-value < 0.001 wilcoxon-test, Figure 3E). Interestingly, repressed regions (marked by H3K27me3) showed lower activity than low coverage regions. Moreover, GenoSTAN-Poilog-K562 enhancers showed signiﬁcantly higher enhancer activity than those of three other studies (Figure 3F), including the original study (p-value < 0.01, ChromHMM-nature enhancers) by Kheradpour et al. [30]. This analyis shows that GenoSTAN has higher succes rate in predicting in vivo enhancer activity than previous methods.

Comparison of the GenoSTAN, ChromHMM, Segway and EpicSeg algorithms on a common dataset

The K562 genome segmentations of ChromHMM, Segway and EpicSeg used so far were derived from different combinations of data tracks. To verify that the favorable performance of GenoSTAN is mainly due to an improved modeling and not due to different data, we also ran ChromHMM, Segway and EpicSeg on the same data as GenoSTAN-Poilog-K562 and GenoSTAN-nb-K562. Both GenoSTAN-Poilog-K562 and GenoSTAN-nb-K562 had a lower FDR at a similar or higher recall than all three other methods (Supplementary Figure 3). Moreover, we found that changing the binarization for ChromHMM dramatically affected its outcome. Without further manual processing of the data, ChromHMM fitted only one transcriptionally active state, which modeled both promoters and enhancers, regardless of state number (Supplementary Figure 3). We suspected that the high read coverage in the H3K4me1 and H3K4me3 signal tracks made promoters and enhancers indistinguishable after binarization (H3K4me1 and H3K4me3 were called present at both, promoters and enhancers, and they were both called absent elsewhere). When all data tracks were subsampled to the same (and lower) library size, this problem was solved and ChromHMM fitted multiple transcriptionally active states thereby distinguishing promoters from enhancers and, at the same time, increased in accuracy (Supplementary Figure 3). The same problem occurred for Segway, but changing Segway’s parameters did not help distinguish different transcriptionally active chromatin states. To make sure that these results did not depend on the arbitrary choice of the number of states, we ran each method using 10 to 30 states (Methods) and calculated the precision of each state S for recalling HOT (respectively TSS) regions as the fraction of all segments annotated with S that overlapped with a HOT (respectively TSS) region. For each number of states and each segmentation algorithm, we determined the state with highest precision (Sup- plementary Fig. 4A, B). Independently of the number of states, GenoSTAN-Poilog and GenoSTAN-nb consistently performed best. Even at low state numbers, precision remained constantly high, while it decreased considerably for

8 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

other methods. We also derived an area under curve (AUC) score for each model, to assess the spatial accuracy in calling TSSs or HOT regions (Supplementary Fig. 4C, D and Methods). Again, AUC scores were consistently highest for the GenoSTAN segmentations. Altogether this extensive benchmark in the K562 cell line demonstrates that GenoSTAN-Poilog and to a slightly lesser extent GenoSTAN-nb, outperforms current chromatin state annotation algorithms for identifying enhancers and promoters.

Benchmarks II and III: Chromatin state annotation for ENCODE and Roadmap Epigenomics cell types and tissues

We next applied GenoSTAN to 127 cell types and tissues from ENCODE and Roadmap Epigenomics, the largest compendium of chromatin-related data, using genomic input and the five chromatin marks H3K4me1, H3K4me3, H3K36me3, H3K27me3, and H3K9me3 that have been profiled across the whole compendium [12] (Supplementary Figure 5, Benchmark II, Figure 1B for data tracks and model names). Moreover, we performed a dedicated analysis to 20 of these cell types and tissues which had three further important data tracks: H3K27ac, H3K9ac and DNase- Seq (Supplementary Figure 6, Benchmark III, Figure 1B for data tracks and model names). These further three tracks are important features of active promoters and enhancers, which can lead to more precisely mapped enhancer boundaries [13]. We performed similar comparisons as decribed above to the three available segmentations from the Roadmap Epigenomics project with 15, 18 and 25 states (ChromHMM-15, -18, and -25) [12, 19]. All methods were less performant than in Benchmark I, possibly due to lower read coverage or to less rich data. Nonetheless, the GenoSTAN annotations consistently outperformed the existing ones. Specifically, this held when assessing the recovery of FANTOM5 CAGE tags (Figure 4A, assessed for all 127 cell types and tissues), of GRO-cap TSSs (Figure 4B assessed for the cell types with available GRO-Cap TSSs), and of HOT regions (Figure 4C, assessed for the cell types with available HOT regions). Moreover, both GenoSTAN models distinguished better promoters from enhancers than previous annotations (Figure 4D, Supplementary Table 2). The low accuracy of ChromHMM-15 and ChromHMM-18 promoters might be caused by frequent state switching between the promoter and promoter flanking state (Supplementary Figure 7). Consequently, the number of promoter regions in K562 was up to 30% higher in the ChromHMM-15 and -18 segmentations than in the GenoSTAN or ChromHMM-25 segmentations (Supplementary Table 1). The number of predicted enhancers (in K562) also differed greatly. The ChromHMM- 15 and -18 state models predict 92,824 (7_Enh) and 22,678 (9_EnhA1) enhancers, while ChromHMM-25 predicts 12,706 and GenoSTAN-Poilog-20 and -127 predict 15,655 and 45,955 enhancers (Supplementary Table 1). Although GenoSTAN predicted more enhancers than the ChromHMM-25 model, the fraction of putative enhancers bound by individual TFs was greater (Figure 4E). For instance 46% (25%) of enhancers were bound by Pol II in the GenoSTAN-Poilog-20 (-127) model, compared to 8%, 18% and 36% in the ChromHMM 15, 18 and 25 state models. Also, the lineage-specific enhancer-binding transcription factor TAL1 binds at 37% (GenoSTAN-Poilog-20) and 27% (GenoSTAN-Poilog-127) of predicted enhancers. Conversely, 13%, 16% and 27% of putative enhancers were bound by TAL1 in the respective 15, 18 and 25 state ChromHMM models (Figure 4E). Collectively, these results show that the improved performance of GenoSTAN is not restricted to the K562 dataset.

9 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Cell-type speciﬁc enrichment of disease- and other complex trait-associated genetic variants at promoters and enhancers

Previous studies showed that disease-associated genetic variants are enriched in potential regulatory regions [12, 20, 58, 57, 54, 42] demonstrating the need for accurate maps of these elements to understand genotype-phenotype relationships and genetic disease. To study the potential impact of variants in regulatory regions on various traits and diseases, we overlapped our enhancer and promoter annotations from 127 cell types and tissues (Benchmark II, Figure 1B) with phenotype-associated genetic variants from the NHGRI genome-wide assocation studies catalog (NHGRI GWAS Catalog [59]). First, we intersected trait-associated variants with enhancer and promoter states (GenoSTAN-Poilog-127). Overall, 37% of all trait-associated SNPs were located in potential enhancers and 7% in potential promoters. The number of traits significantly enriched (at FDR <0.05) with enhancers or promoters in at least one cell type or tissue was larger for GenoSTAN-Poilog-127 (69 traits for GenoSTAN-Poilog-127 for enhancers and 20 traits for promoters) than for the best performing ChromHMM-model (ChromHMM-15, 64 traits for enhancers and 18 traits for promoters). The better performance of GenoSTAN-Poilog-127 was found at all FDR cutoffs (Supplementary Figure 8). To control for the fact that methods can differ among each other regarding the length of the promoters and enhancers they predict, we furthermore computed the recalls of GWAS variants for a fixed genomic coverage. Restricting to a total genomic coverage of 2% (random subsetting, also allowing confidence interval computation, Methods), enhancers of all GenoSTAN models overlaped a higher fraction of GWAS variants at a similar to better per base pair density compared to the current ChromHMM annotations (Figure 5A). The same trend was observed for promoters when restricting to 1% of genomic coverage (Figure 5B). The improved overlap with trait-associated variants indicates that GenoSTAN annotation has a higher enrichment for functional elements than the current annotation. In accordance with previous studies [12, 20] we found that individual variants were strongly enriched in enhancer or promoter states specifically active in the relevant cell types or tissues (Figure 5C, Supplementary Figure 8C). Variants associated with height were significantly associated with osteoblasts (at FDR <0.001 here and after, performed on Benchmark II for consistency across cell types and tissues). Variants associated with immune response or autoimmune disorders were enriched in B- and T-cell enhancers (Figure 5C) and promoters (Supplementary Figure 8C). These incude for instance HIV-1 control, autoimmune disease associated SNPs for systemic lupus erythematosus, inflammatory bowel disease, Ulcerative colitis, Rheumatoid arthritis, Primary biliary cirrhosis and Multiple sclerosis. Variants associated with electrocardiographic traits and QT interval were enriched in fetal heart enhancers. SNPs associated with colorectal cancer were enriched in enhancers specific to the digestive system. These results illustrate that the annotation of potential promoters and enhancers generated in this study can be of great use for interpreting genetic variants associated, and underscore the importance of cell-type or tissue specific annotations.

A novel annotation of enhancers and promoters in human cell types and tissues

We then compiled the results from the best performing annotations for each cell type and tissue into a single annotation ﬁle. The combined annotation ﬁle is available as Supplementary Data. All individual chromatin state annotations are available at http://i12g-gagneurweb.in.tum.de/public/paper/GenoSTAN. For the combined anno-

10 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

tation file, we chose GenoSTAN with Poisson-lognormal in every instance, as it performed best in almost every comparison we conducted. We used the results from benchmark I for K562, from benchmark III for the 20 cell types and tissues, and from benchmark II for all the remaining Roadmap Epigenomics cell types and tissues. Over- all, our annotation reports typically between 8,945 and 16,750 (10% and 90% quantiles of number of promoters across all 127 cell types and tissues) active promoters per cell type or tissue. This number is consistent with the typical number of expressed genes per tissue (in 11,953 to 16,869 range, [52]). However, the median width of these elements depends on the data on which the annation was based. For the benchmark III dataset, promoters are much narrower (800bp median) than for the K562 annotations (1.4 kb, Benchmark I data sst), suggesting that promoter regions in the 20 cell types more accurately recover DNase hypersensitivity sites (DHS) of the core promoter (Figure 2, Supplementary Figure 6). The number of enhancers per cell type or tissue varied more greatly (between 8,208 and 33,596 for the 10% and 90% quantiles). The large variation of the number of enhancers might be partly due to differences of sensitivity in complex biological samples. Consistent with this hypothesis, much fewer enhancers were identified in tissues than in primary cells and cell lines (Supplementary Figure 9) likely because enhancers that are active only in a small subsets of all cell types present of a tissue may be not detected. As more cell-type specific data will be available, improved maps can be generated. The GenoSTAN software, which is publicly available, will be instrumental to update these genomic annotations.

Promoters and enhancers have a distinct TF regulatory landscape

The biochemical distinction between enhancers and promoters is a topic of debate [53, 6]. We explored to which extent enhancers and promoters are differentially bound by TFs using the K562 cell line dataset because i) we obtained the most accurate annotation for this cell line (GenoSTAN-Poilog-K562, Benchmark I) and ii) ChIP-seq data was available for as many as 101 TFs in this cell line [13]. Nine TF modules were defined by clustering based on binding pattern similarity across enhancers and promoters (Methods, Figure 6). These 9 TF modules were further characterized by the propensity of their TFs to bind promoters, enhancers or both (Figure 6). In accordance with previous studies [4, 22], this recovered many complexes and promoter-associated and enhancer-associated proteins, including the CTCF/cohesin complex (CTCF, Rad21, SMC3, Znf143), the AP-1 complex (Jun, JunB, FOSL1, FOS), Pol3, promoter and enhancer associated modules, and factors associated with chromatin repression (EZH2, HDAC6). Moreover, the modules identified provided insights into the distinction of promoters and enhancers. On the one hand, some TFs are common to both enhancers and promoters, which supports previous reports [5, 6]. In accordance with the recent finding of widespread transcription at enhancers [16], Pol II and multifunctional TFs Myc, Max, and MAZ [56] are part of a TF module - which we called the Promoter-Enhancer-Module (PEM) - which had approximately equal binding preferences for promoter and enhancer states, but also co-localized with other TFs specifically binding enhancers or promoters (Figure 6). On the other hand enhancers and promoters were also bound by distinct TFs, which is consistent with previously reported TF co-occurrence patterns at gene-proximal and gene-distal sites [22, 4]. Among the promoter and enhancer-associated proteins we defined Promoter module 1 and 2 (PM1, PM2), Enhancer module 1 and 2 (EM1, EM2), which had a strong preference for binding either a promoter or an enhancer, but exhibited different co- binding rates (Figure 6). Promoter module 1 contained TFs which were specifically enriched in promoter states and associated with basic promoter functions, such as chromatin remodeling (CHD1, CHD2), transcription initiation or

11 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

elongation (TBP, TAF1, CCNT2, SP1) and other TFs involved in the regulation of specific gene classes (e.g. cell cycle: E2F4) [56]. However, it also included TFs known as transcriptional repressors (e.g. Mxi1, a potential tumor suppressor, which negatively regulates Myc). While TFs in PM1 showed a high co-binding rate, PM2 factors exhibited low co-binding. This might be partially explained by lower efficiency of the ChIP, since PM2 also contained general TFs such as TFIIB, TFIIF or the Serine 2 phospho-isoform of Pol II, which are expected to co-localize with other general TFs from PM1. EM1 contained TFs with high co-binding rate, which included TAL1, an important lineage-specific regulator for ery- throid development (K562 are erythroleukemia cells) and which had been shown to interact with CEBPB, GATA1 and GATA2 at gene-distal loci [22, 47]. It also contained the enhancer-specific transcription factor P300 [55] and transcriptional activators (e.g. ATF1) and repressors (e.g. HDAC2, REST) [56]. Analogously to PM2, EM2 contained enhancer-specific transcriptional activators and repressors with a low co-binding rate. Altogether this analysis highlights the common and distinctive TF binding properties of enhancers and promoters.

Discussion

We introduced GenoSTAN, a method for de novo and unbiased inference of chromatin states from genome-wide profiling data. In contrast to previously described methods for chromatin state annotation, GenoSTAN directly models read counts, thus avoiding data transformation and the manual tuning of thresholds (as in ChromHMM and Segway), and variance is not shared between data tracks or states (as in EpicSeg and Segway) [43, 17, 26]. GenoSTAN is released as part of the open-source R/Bioconductor package STAN [62, 28, 21], which provides a fast, multiprocessing implementation that can process data from 127 human cell types in less 3-6 days (GenoSTAN- Poilog-127: 6 days, -nb: 3 days). Application of GenoSTAN significantly improved chromatin state maps of 127 cell types and tissues from the EN- CODE and Roadmap Epigenomics projects [13, 12]. Binding of enhancer-associated co-activator CBP and histone acetyltransferase P300 was used by several studies for the genome-wide prediction of enhancers [55, 45, 2]. From these predictions a distinctive chromatin signature for promoters and enhancers was derived based on H3K4me1 and H3K4me3 [2]. In particular, the ratio H3K4me1/H3K4me3 was found to be low at promoters, in comparison to enhancers. Active and poised enhancers could also be distinguished by presence or absence of H3K27me3 and H3K9me3 [63]. All these features could be confirmed by GenoSTAN, making it a promising tool for the biochemical characterization of enhancers and promoters. Moreover, extensive benchmarks based on independent data including transcriptional activity, TF binding, cis-regulatory activity, and enrichment for complex trait-associated variants showed the highest accuracy of GenoSTAN annotations over former genome segmentation methods. The GenoSTAN annotation sheds light on the common and distinctive features of promoters and enhancers, which currently are an intense subject of debate [53, 6]. Among other characteristics, a shared architecture of promoters and enhancers was proposed based on the recent discovery of widespread bidirectional transcription at enhancers [31, 6, 16]. This was supported by the observation that enhancers, which are depleted in CpG islands have similar transcription factor (TF) motif enrichments as CpG poor promoters [5]. However, another study showed that TF co-occurrence differed between gene-proximal and gene-distal sites [22, 4]. GenoSTAN chromatin states revealed a very distinct TF regulatory landscape of these elements and therefore suggest that promoters and enhancers are fundamentally different regulatory elements, both sharing the binding of the core transcriptional machinery. Our annotation of enhancers and promoters will be a valuable resource to help characterizing the genomic context of

12 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

the binding of further TFs. Indirectly, our analysis showed that chromatin state annotations are better predictors of enhancers than the transcription-based definition provided by the FANTOM5 consortium [5]. While FANTOM5 enhancers are an accurate predictor for transcriptionally active enhancers, the sensitivity remains poor (only 4,263 enhancers were called by overlap with GRO-cap TSSs and DHS, which is less than the estimated number of transcribed genes, for K562 cells compared to about 20,000-30,000 for ChromHMM and 10,000-20,000 for GenoSTAN). Although, the sensitivity of the transcription-based approach can increase with transient transcriptome profiling [48, 46] or nascent transcriptome profiling [11], the chromatin state data undoubtedly add valuable information for the indentification of promoters and enhancers. Because it models count data, GenoSTAN analysis can in principle also integrate RNA-seq profiles, for instance using it in a strand-specific fashion [62]. Systematic identification of cis-regulatory active elements by direct activity assays is notoriously difficult. STARR- Seq for instance is a high-throughput reporter assay for the de novo identification of enhancers [7]. It was previously used to identify thousands of cell-type specific enhancers in Drosophila, but has not been applied to human yet. Moreover, STARR-Seq makes rigid assumptions about the location of the enhancer element with respect to the promoter, and it does not account for the native chromatin structure. This might identify regions that are inactive in situ [7]. Other experimental assays for the validation of predicted ENCODE enhancers lead to different results [35, 30]. Complementary to these approaches, the systematic evaluation of cis-regulatory activity based on candidate regions in human cells have made progress with the advent of high-throughput CRISPR perturbation assays [50]. Because it requires candidate cis-regulatory regions in a first place, such approach will benefit from improved annotation maps as the one we are providing. Thus, we foresee GenoSTAN to be instrumental in future efforts to generate robust, genome-wide maps of functional genomic regions like promoters and enhancers

13 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

References

[1] The blueprint project. http://www.blueprint-epigenome.eu/.

[2] Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature genetics, 39(3):311–8, 2007.

[3] Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat Rev Genet, 13(4):233–245, 2012.

[4] Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Research, 23:1142–1154, 2013.

[5] R. Andersson, C. Gebhard, and I. et al. Miguel-Escalada. An atlas of active enhancers across human cell types and tissues. Nature, 507(7493):455–461, Mar 2014.

[6] Robin Andersson. Promoter or enhancer, what’s the diﬀerence? Deconstruction of established distinctions and presentation of a unifying model. BioEssays, 37(3):314–323, 2015.

[7] Cosmas D Arnold, Daniel Gerlach, Christoph Stelzer, Lukasz M Boryn, Martina Rath, and Alexander Stark. Genome-wide quantitative enhancer activity maps identiﬁed by STARR-seq. Science (New York, N.Y.), 339(6123):1074–1077, 2013.

[8] J Banerji, S Rusconi, and W Schaﬀner. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell, 27(2 Pt 1):299–308, 1981.

[9] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. Gingeras, E. H. Margulies, Z. Weng, M. Sny- der, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M. Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum, R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clelland, S. Davis, N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy, M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson, T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri, S. C. Parker, P. J. Sabo, R. Sand- strom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox, M. Yu, F. S. Collins, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev, W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky, D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. Sandelin, I. L. Hofacker, R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sekinger, J. Lagarde, J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. Lindemeyer, K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd, R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T. Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Srinivasan, W. K. Sung, H. S. Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia, S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark, J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Hen- richsen, C. Kai, J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel, J. S. Mattick, P. Carninci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers, J. Rogers, P. F. Stadler, T. M. Lowe, C. L. Wei, Y. Ruan, K. Struhl, M. Gerstein, S. E. Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Sie- pel, J. Taylor, L. A. Liefer, K. A. Wetterstrand, P. J. Good, E. A. Feingold, M. S. Guyer, G. M. Cooper, G. Asimenos, C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan, F. Pardi,

14 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

T. Massingham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullikin, A. Ureta-Vidal, B. Paten, M. Seringhaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone, S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D. Trinklein, Z. D. Zhang, L. Barrera, R. Stuart, D. C. King, A. Ameur, S. Enroth, M. C. Bieda, J. Kim, A. A. Bhinge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. Lee, P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley, D. Inman, M. A. Singer, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman, J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F. Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar, N. Heintz- man, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld, S. F. Aldred, S. J. Cooper, A. Halees, J. M. Lin, H. P. Shulha, X. Zhang, M. Xu, J. N. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham, B. Ren, R. A. Harte, A. S. Hinrichs, H. Trumbower, H. Clawson, J. Hillman- Jackson, A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol, C. P. Bird, P. I. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Martin, B. E. Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert, M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen, J. R. Idol, V. V. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C. Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley, H. Jiang, G. M. Weinstock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. Lindblad-Toh, E. S. Lander, M. Koriabine, M. Nefedov, K. Osoegawa, Y. Yoshinaga, B. Zhu, and P. J. de Jong. Identification and analysis of functional elements in 1human genome by the ENCODE pilot project. Nature, 447(7146):799–816, Jun 2007.

[10] M. G. Bulmer. On Fitting the Poisson Lognormal Distribution to Species-Abundance Data. Biometrics, 30(1):101–110, 2011.

[11] L. S. Churchman and J. S. Weissman. Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature, 469(7330):368–373, Jan 2011.

[12] Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.

[13] The ENCODE Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57–74, 2012.

[14] The Fantom Consortium. A promoter-level mammalian expression atlas. Nature, 507(7493):462–470, 2014.

[15] John D Cook. Notes on the Negative Binomial Distribution, 2009.

[16] Leighton J. Core, André L. Martins, Charles G. Danko, Colin T Waters, Adam Siepel, and John T. Lis. Analysis of transcription start sites from nascent RNA supports a uniﬁed architecture of mammalian promoters and enhancers. Submitted, 46(12):1311–1320, 2014.

[17] Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature biotechnology, 28(8):817–825, 2010.

[18] Jason Ernst and Manolis Kellis. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods, 9(3):215–216, 2012.

15 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

[19] Jason Ernst and Manolis Kellis. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nature Biotechnology, 33(4):364–376, 2015.

[20] Jason Ernst, Pouya Kheradpour, Tarjei S Mikkelsen, Noam Shoresh, Lucas D Ward, Charles B Epstein, Xiaolan Zhang, Li Wang, Robbyn Issner, Michael Coyne, Manching Ku, Timothy Durham, Manolis Kellis, and Bradley E Bernstein. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.

[21] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang, and J. Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5(10):R80, 2004.

[22] Mark B. Gerstein, Anshul Kundaje, Manoj Hariharan, Sherman M. Weissman, and Michael Snyder. Architec- ture of the human regulatory network derived from ENCODE data. Nature, 489(7414):91–100, 2012.

[23] Vidar GrÃžtan and Steinar Engen. poilog: Poisson lognormal and bivariate Poisson lognormal distribution, 2008. R package version 0.4.

[24] J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Bar- rell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan, G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Derrien, J. Chrast, N. Walters, S. Balasubramanian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren, M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigo, and T. J. Hubbard. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res., 22(9):1760–1774, Sep 2012.

[25] Sven Heinz, Casey E Romanoski, Christopher Benner, and Christopher K Glass. The selection and function of cell type-speciﬁc enhancers. Nature Reviews Molecular Cell Biology, 16(3):144–154, 2015.

[26] Michael M Hoffman, Orion J Buske, Jie Wang, Zhiping Weng, Jeff a Bilmes, and William Stafford Noble. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods, 9(5):473–476, 2012.

[27] Michael M. Hoffman, Jason Ernst, Steven P. Wilder, Anshul Kundaje, Robert S. Harris, Max Libbrecht, Belinda Giardine, Paul M. Ellenbogen, Jeffrey a. Bilmes, Ewan Birney, Ross C. Hardison, Ian Dunham, Manolis Kellis, and William Stafford Noble. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Research, 41(2):827–841, 2013.

[28] Ross Ihaka and Robert Gentleman. R: A Language for Data Analysis and Graphics. J. Comp. Graph. Stat., 5(3):299–314, 1996.

[29] Peter V Kharchenko, Artyom a Alekseyenko, Yuri B Schwartz, Aki Minoda, Nicole C Riddle, Jason Ernst, Peter J Sabo, Erica Larschan, Andrey a Gorchakov, Tingting Gu, Daniela Linder-Basso, Annette Plachetka, Gregory Shanower, Michael Y Tolstorukov, Lovelace J Luquette, Ruibin Xi, Youngsook L Jung, Richard W Park, Eric P Bishop, Theresa K Canﬁeld, Richard Sandstrom, Robert E Thurman, David M MacAlpine, John a Stamatoyannopoulos, Manolis Kellis, Sarah C R Elgin, Mitzi I Kuroda, Vincenzo Pirrotta, Gary H Karpen,

16 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

and Peter J Park. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature, 471(7339):480–485, 2011.

[30] Pouya Kheradpour, Jason Ernst, Alexandre Melnikov, Peter Rogov, Li Wang, Xiaolan Zhang, Jessica Alston, Tarjei S. Mikkelsen, and Manolis Kellis. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Research, 23:800–811, 2013.

[31] Tae-Kyung Kim, Martin Hemberg, Jesse M. Gray, Allen M. Costa, Daniel M. Bear, Jing Wu, David A. Harmin, Mike Laptewicz, Kellie Barbara-Haley, Scott Kuersten, Eirene Markenscoﬀ-Papadimitriou, Dietmar Kuhl, Haruhiko Bito, Paul F. Worley, Gabriel Kreiman, and Michael E. Greenberg. Widespread transcription at neuronal activity-regulated enhancers. Nature, 465(7295):182–187, 2010.

[32] D. Kleftogiannis, P. Kalnis, and V. B. Bajic. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Research, 43(1):e6–e6, 2015.

[33] Dimitrios Kleftogiannis, Panos Kalnis, and Vladimir B Bajic. Progress and challenges in bioinformatics approaches for enhancer identiﬁcation. Brieﬁngs in Bioinformatics, (August):1–13, 2015.

[34] E. Z. Kvon, G. Stampfel, J. O. Yanez-Cuna, B. J. Dickson, and A. Stark. HOT regions function as patterned developmental enhancers and have a distinct cis-regulatory signature. Genes Dev., 26(9):908–913, May 2012.

[35] J. C. Kwasnieski, C. Fiore, H. G. Chaudhari, and B. a. Cohen. High-throughput functional testing of ENCODE segmentation predictions. Genome Research, pages gr.173518.114–, 2014.

[36] B. Langmead and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9(4):357–359, Apr 2012.

[37] Michael Lawrence, Robert Gentleman, and Vincent Carey. rtracklayer: an r package for interfacing with genome browsers. Bioinformatics, 25:1841–1842, 2009.

[38] Michael Lawrence, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin Morgan, and Vincent Carey. Software for computing and annotating genomic ranges. PLoS Computational Biology, 9, 2013.

[39] Dongwon Lee, Rachel Karchin, and Michael A Beer. Discriminative prediction of mammalian enhancers from DNA sequence. Genome research, 21(12):2167–80, 2011.

[40] H. Li, H. Chen, F. Liu, C. Ren, S. Wang, X. Bo, and W. Shu. Functional annotation of HOT regions in the human genome: implications for human disease and cancer. Sci Rep, 5:11633, 2015.

[41] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, Aug 2009.

[42] Kerstin Lindblad-Toh, Manuel Garber, Or Zuk, Michael F. Lin, Brian J. Parker, Stefan Washietl, Pouya Kheradpour, Jason Ernst, Gregory Jordan, Evan Mauceli, Lucas D. Ward, Craig B. Lowe, Alisha K. Holloway, Michele Clamp, Sante Gnerre, Jessica Alfoldi, Kathryn Beal, Jean Chang, Hiram Clawson, James Cuﬀ, Federica Di Palma, Stephen Fitzgerald, Paul Flicek, Mitchell Guttman, Melissa J. Hubisz, David B. Jaﬀe, Irwin Jungreis, W. James Kent, Dennis Kostka, Marcia Lara, Andre L. Martins, Tim Massingham, Ida Moltke, Brian J. Raney,

17 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Matthew D. Rasmussen, Jim Robinson, Alexander Stark, Albert J. Vilella, Jiayu Wen, Xiaohui Xie, Michael C. Zody, Jen Baldwin, Toby Bloom, Chee Whye Chin, Dave Heiman, Robert Nicol, Chad Nusbaum, Sarah Young, Jane Wilkinson, Kim C. Worley, Christie L. Kovar, Donna M. Muzny, Richard A. Gibbs, Andrew Cree, Huyen H. Dihn, Gerald Fowler, Shalili Jhangiani, Vandita Joshi, Sandra Lee, Lora R. Lewis, Lynne V. Nazareth, Geoﬀrey Okwuonu, Jireh Santibanez, Wesley C. Warren, Elaine R. Mardis, George M. Weinstock, Richard K. Wilson, Kim Delehaunty, David Dooling, Catrina Fronik, Lucinda Fulton, Bob Fulton, Tina Graves, Patrick Minx, Erica Sodergren, Ewan Birney, Elliott H. Margulies, Javier Herrero, Eric D. Green, David Haussler, Adam Siepel, Nick Goldman, Katherine S. Pollard, Jakob S. Pedersen, Eric S. Lander, and Manolis Kellis. A high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478(7370):476–482, 2011.

[43] Alessandro Mammana and Ho-Ryun Chung. Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome Biology, 16(1):151, 2015.

[44] Guillemette Marot, David Castel, Jordi Estelle, Gregory Guernec, Bernd Jagla, Nicolas Servant, Luc Jouneau, Denis Laloe, Caroline Le Gall, and Brigitte Schae. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. 14(6), 2012.

[45] Dalit May, Matthew J Blow, Tommy Kaplan, David J McCulley, Brian C Jensen, Jennifer A Akiyama, Amy Holt, Ingrid Plajzer-Frick, Malak Shoukry, Crystal Wright, Veena Afzal, Paul C Simpson, Edward M Rubin, Brian L Black, James Bristow, Len A Pennacchio, and Axel Visel. Large-scale discovery of enhancers from human heart tissue. Nature genetics, 44(1):89–93, 2012.

[46] C. Miller, B. Schwalb, K. Maier, D. Schulz, S. Dumcke, B. Zacher, A. Mayer, J. Sydow, L. Marcinowski, L. Dolken, D. E. Martin, A. Tresch, and P. Cramer. Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeast. Mol. Syst. Biol., 7:458, Jan 2011.

[47] Tõnis Org, Dan Duan, Roberto Ferrari, Amelie Montel-hagen, Ben Van Handel, A Marc, Rajkumar Sasidharan, Liudmilla Rubbi, Yuko Fujiwara, Matteo Pellegrini, Stuart H Orkin, Siavash K Kurdistani, and Hanna K A Mikkola. Scl binds to primed enhancers in mesoderm to regulate hematopoietic and cardiac fate divergence. (January):1–20, 2015.

[48] M. Rabani, R. Raychowdhury, M. Jovanovic, M. Rooney, D. J. Stumpo, A. Pauli, N. Hacohen, A. F. Schier, P. J. Blackshear, N. Friedman, I. Amit, and A. Regev. High-resolution sequencing and modeling identiﬁes distinct dynamic RNA regulatory strategies. Cell, 159(7):1698–1710, Dec 2014.

[49] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2):257–286, February 1989.

[50] N. Rajagopal, S. Srinivasan, K. Kooshesh, Y. Guo, M. D. Edwards, B. Banerjee, T. Syed, B. J. Emons, D. K. Giﬀord, and R. I. Sherwood. High-throughput mapping of regulatory DNA. Nat. Biotechnol., 34(2):167–174, Feb 2016.

[51] Nisha Rajagopal, Wei Xie, Yan Li, Uli Wagner, Wei Wang, John Stamatoyannopoulos, Jason Ernst, Manolis Kellis, and Bing Ren. RFECS: a random-forest based algorithm for enhancer identiﬁcation from chromatin state. PLoS computational biology, 9(3):e1002968, 2013.

18 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

[52] D. Ramskold, E. T. Wang, C. B. Burge, and R. Sandberg. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol., 5(12):e1000598, Dec 2009.

[53] Daria Shlyueva, Gerald Stampfel, and Alexander Stark. Transcriptional enhancers: from properties to genome- wide predictions. Nature reviews. Genetics, 15(4):272–86, 2014.

[54] Gosia Trynka, Cynthia Sandor, Buhm Han, Han Xu, Barbara E Stranger, X Shirley Liu, and Soumya Raychaud- huri. Chromatin marks identify critical cell types for ﬁne mapping complex trait variants. Nature Genetics, 45(2):124–130, 2012.

[55] Axel Visel, Matthew J. Blow, Zirong Li, Tao Zhang, Jennifer A. Akiyama, Amy Holt, Ingrid Plajzer-Frick, Malak Shoukry, Crystal Wright, Feng Chen, Veena Afzal, Bing Ren, Edward M. Rubin, and Len A. Pennacchio. ChIP-seq accurately predicts tissue-speciﬁc activity of enhancers. Nature, 457(7231):854–858, 2009.

[56] J. Wang, J. Zhuang, S. Iyer, X. Lin, T. W. Whitﬁeld, M. C. Greven, B. G. Pierce, X. Dong, A. Kundaje, Y. Cheng, O. J. Rando, E. Birney, R. M. Myers, W. S. Noble, M. Snyder, and Z. Weng. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res., 22(9):1798–1812, Sep 2012.

[57] Lucas D. Ward and Manolis Kellis. HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Research, page gkv1340, 2015.

[58] Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander, and William Lee. Genome-wide analysis of noncoding regulatory mutations in cancer. Nature Genetics, 46(11):1160–5, 2014.

[59] D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorﬀ, and H. Parkinson. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research, 42(D1):D1001–D1006, 2014.

[60] Kyoung-Jae Won, Xian Zhang, Tao Wang, Bo Ding, Debasish Raha, Michael Snyder, Bing Ren, and Wei Wang. Comparative annotation of functional regions in the human genome using epigenomic data. Nucleic acids research, 41(8):4423–32, 2013.

[61] Kevin Y Yip, Chao Cheng, Nitin Bhardwaj, James B Brown, Jing Leng, Anshul Kundaje, Joel Rozowsky, Ewan Birney, Peter Bickel, Michael Snyder, and Mark Gerstein. Classiﬁcation of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biology, 13(9):R48, 2012.

[62] B. Zacher, M. Lidschreiber, P. Cramer, J. Gagneur, and A. Tresch. Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle. Molecular Systems Biology, 10(12):768–768, 2014.

[63] Gabriel E Zentner, Paul J Tesar, and Peter C Scacheri. Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome research, 21(8):1273–83, 2011.

[64] Daniel R Zerbino, Steven P Wilder, Nathan Johnson, Thomas Juettemann, and Paul R Flicek. The Ensembl Regulatory Build. Genome Biology, 16(1):1–8, 2015.

19 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Methods

Availability of GenoSTAN and chromatin state annotations

GenoSTAN is freely available from http://bioconductor.org/ as part of our prevsiouy published R/Bioconductor package STAN [62]. Chromatin state annotations for benchmark I, II and III can be downloaded from http://i12g- gagneurweb.in.tum.de/public/paper/GenoSTAN. The combined promoter and enhancer annotation is available as Supplementary Data.

Motivation of Poisson-lognormal and negative binomial emissions

The Poisson-lognormal and the negative binomial distribution can be thought of as extensions of the Poisson distribution that allow for greater variance. We will now motivate both distributions from a Poisson distribution with a prior on the mean of the Poisson. Suppose that X ∼ P oisson(x|Λ) is a Poisson random variable and Λ ∼ Gamma (λ|α, β). From this we can derive the negative binomial with success rate p and size r [15]:

∞ p Pr (X = x|α, β) = P oisson (x|λ) Gamma λ|α = r, β = dλ ˆ 1 − p 0 ∞ 1−p x −λ p λ −λ r−1 e = e λ r dλ ˆ x! p 0 1−p Γ(r) Γ(r + x) = px (1 − p)r where r > 0, p ∈ [0, 1] x!Γ (r)

r(1−p) In order to increase interpretability in the context of read counts, we re-parameterize this with mean µ = p :

Γ(r + x) r x r r Pr (X = x|µ, r) = 1 − where µ > 0 x!Γ (r) r + µ r + µ The Poisson-lognormal distribution can be motivated likewise. Assume that X ∼ P oisson(x|Λ) is a Poisson random variable and Λ ∼ N (log (λ) |µ, σ). Then the Poisson-lognormal is given by [10]:

∞ Pr (X = x|µ, σ) = P oisson (x|λ) N (log (λ) |µ, σ) dλ ˆ 0 √ ∞ 2 2 2πσ x−1 −λ − (log(λ)−µ) = λ e e 2σ2 dλ x! ˆ 0

A closed form solution for this distribution does not exist. Thus numerical integration is needed to calculate probabilities, which is done in GenoSTAN by using the R package poilog [23, 28].

20 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Optimization of Poisson-lognormal and negative binomial emissions

D Let O = (o0, ..., oT ), ot = (ot,d)d∈D ∈ N0 be an obsevational sequence of |D|-dimensional count vectors ot. An HMM assumes that each observation ot is emitted by a corresponding hidden (unobserved) variable st, t = 0, ..., T . A hidden variable can assume values from a ﬁnite set of states K. Each state k ∈ K is associated to an emission distribution ψk, which deﬁnes the probability of making a certain observation, ψk(ot). GenoSTAN assumes that Q the components ot,d, d ∈ D, of a single observation ot are independent, and hence ψk(ot) = d∈D ψk,d(ot,d). The value of st determines the probability of observing ot by Pr(ot | st) = ψst (ot). HMM learning is carried out using the Baum-Welch algorithm [49]. The optimization problem for the paramters of a single emission distribution ψi,d can be written as

T X arg max Pr (st = i|O) log ψi,d (ot,d) , ψi,d t=0

where Pr(st = i | O) is calculated eﬃciently by the Forward-Backward algorithm, and ψi,d is maximized within the class of negative binomial or Poisson-lognormal distributions. An analytical solution for this problem does not exist. Thus, we resort to numerical optimzation. As indicated by [43], above formula can be very costly to compute, since the function needs to evaluate a sum over the complete observation sequence (i.e. the complete binned genome) in each iteration. However, computations are greatly simpliﬁed by grouping together observations ot,d with the same count number. Let Cd be the set of unique counts in dimension d. Then the following terms can be precomputed for all c ∈ Cd before optimization:

X f (c) = Pr (st = i|O)

t; ot,d=c The objective function becomes

X arg max f (c) log ψi,d (c) ψi,d c∈Cd

which avoids redundant calculations of ψi,d(ot), t = 0, ..., T , and greatly reduces complexity since |Cd| T .

Correction for library size

The sequencing depth can be very diﬀerent between experiments. GenoSTAN addresses this problem by using pre-computed scaling factors to correct for varying sequencing depths for a data track between cell types. In this work, the ’total count’ method is used [44]. Let L be the set of cell types and rd,l the number of reads of data track d ∈ D in cell line l ∈ L. The scaling factor is then computed as P rd,l k∈C rd,k sd,l = P · k∈L rd,k |L|

µ The probability of an observation ot,l is calculated as Pr ot,l| , r in the case of negative binomial and sd,l µ Pr ot,l| log , σ in the case of Poisson-lognormal emissions. sd,l

21 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Model initialization

Initialization of model parameters is crucial for HMMs since the EM algorithm is a gradient method which converges to a local maximum. K-means is a widely used approach to derive an initial clustering to estimate model parameters [49]. In order to make this approach applicable to sequencing data, we added a pseudocount and log-transformed the data before k-means clustering. However, without further processing k-means rarely converged and the procedure was slow on the complete data set. To address these issues, we further processed and ﬁltered the data. First, a threshold for signal enrichment for each data track is calculated using the default binarization approach of −4 ChromHMM [17]. The threshold is the smallest discrete number nd > 0 such that Pr (X > nd) < 10 where X PT t=0 ot,d is a Poisson random variable with mean λd = T +1 . All ot,d < nd were set to 0, which improved convergence of k-means. To improve the speed, all genomic bins ot,d where ∀d ∈ D : ot,d = 0 were removed and deﬁned as a ’background cluster’. K-means was then run on the rest of the data with |K| − 1 clusters. This clustering (the ’background’ and k-means clusters) was then used to derive an initial estimate of emission function parameters. Initial state and transition probabilities were initialized uniform.

Data preprocessing

Benchmark I (K562 ENCODE) sequencing data was mapped to the hg20/hg38 (GRCh38) genome assembly (Human Genome Reference Consortium) using Bowtie 2.1.0 [36]. Samtools [41] was used to quality filter SAM files, whereby alignments with MAPQ smaller than 7 (-q 7) were skipped. To obtain midpoint positions of the ChIP-Seq fragments, the (single end) reads were shifted in the appropriate direction by half the average fragment length as estimated by strand coverage cross-correlation using the R/Bioconductor package chipseq [21]. Next, ChIP-Seq tracks were summarized by the number of fragment midpoints in consecutive bins of 200 bp width. The data for the 127 ENCODE and Roadmap Epigenomics cell types (benchmark II and III) was downloaded as preprocessed tagAlign files from the Roadmap Epigenomics supplementary website [12]. Fragment length was again estimated using the R/Bioconductor package chipseq and reads were shifted by the fragment half size to the average fragment midpoint [21]. The genome was partitioned into 200bp bins and reads were counted within each bin.

Model ﬁtting of GenoSTAN

GenoSTAN was ﬁtted on the complete data of benchmark data set I. The signal used for GenoSTAN model training on Benchmark data set II and III was extracted from ENCODE pilot regions (1% of the human genome analyzed in the ENCODE pilot phase [9]) for each cell type, which together covered 20% and 127% of the human genome. The GenoSTAN-nb-20 model was learned in one day, the GenoSTAN-Poilog-20 model in two days using 10 cores. Model learning on Benchmark set II using 10 cores took three (GenoSTAN-nb-127) and six days (GenoSTAN-Poilog-127). Precomputed library size factors were used to correct for variation in read coverage.

Model ﬁtting of ChromHMM, Segway and EpicSeg

The data was binarized as described in [17] and ChromHMM was fitted with default parameters. Before applying Segway, the data was transformed using the hyperbolic sine function [26] and a runing mean over a 1kb sliding window was computed to smooth the data. Segway was fitted on ENCODE pilot regions using a 200bp resolution. EpicSeg was fitted on the untransformed count data with default parameters.

22 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Processing of chromatin state annotations and external data

All state annotations and external data were lifted to the hg20/hg38 (GRCh38) genome assembly using the liftOver function from the R/Bioconductor package rtracklayer [37]. Overlap of state annotations with external data was calculated with GenomicRanges [38].

Computation of area under curve

AUC values were calculated on Benchmark set I for GenoSTAN, ChromHMM, Segway and EpicSeg. To this end, a segmentation was transformed into a binary classiﬁer and evaluated as follows. Each 200bp bin in the genome overlapping with HOT (TSSs) regions was considered as ’true condition’, the rest as ’false’. For each state S the precision for recalling HOT (TSS) regions was calculated as the fraction of all segments annotated with S that overlapped with a HOT (TSS) region. States were then sorted by decreasing precision. The rank of each state was used as score in the prediction of HOT (TSS) regions on each 200bp bin in the genome, which was then used to calculate AUC values.

Analysis of transcription factor (co-)binding

Enrichment of TFs in chromatin states was calculated as described earlier [4]. The TF co-binding rate was calculated

|STF ∩Si| as the Jaccard-Index: , where STF is the set of TF binding sites and Si is the set genomic regions that are |STF ∪Si| annotated with state i.

Tissue-speciﬁc enrichment of disease- and complex trait-associated variants in regulatory regions

The GWAS catalog was obtained from the gwascat package from Bioconductor [21, 59]. Statistical testing was carried out in a similar manner as described in [12]. The enrichment of SNPs from individual genome-wide associ- ation studies was calculated for traits with at least 20 variants. SNPs for each trait were overlapped with promoter and enhancer regions and tested against the rest of the GWAS catalogue as background using Fisher’s exact test. P-values were adjusted for multiple testing using the Benjamini-Hochberg correction. In order to calculate the recall and frequency of SNPs, promoter and enhancer states were randomly sampled until a genomic coverage of 2% for enhancers and 1% of promoters was reached. This was done to control for the fact that methods can diﬀer among each other regarding the length of the promoters and enhancers they predict. This procedure was repeated 100 times enabling the calculation of 95% conﬁdence intervals.

Acknowledgments

We thank Lars Steinmetz, Judith Zaugg, Aino Jarvelin, Wu Wei and Philip Brennecke for fruitful discussions and hosting BZ during the early phase of the project. BZ was supported by a DAAD short term research grant.

23 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Author contributions

BZ, JG and AT developed the statistical methods and computational workﬂow of the study. BZ developed and implemented all software and scripts and carried out all computational analyses. BS helped with preprocessing of the K562 data set. MM and PC helped with interpretation of the biological results. BZ, JG and AT wrote the manuscript with input from all authors. All authors read and approved the ﬁnal version of the manuscript.

Conﬂict of interest

The authors declare that they have no conﬂict of interest.

24 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figures

A GenoSTAN ChromHMM Segway EpicSeg Count yes no no yes distribution Library size yes no no no correction Track-specific yes no no no variance Running time minutes-hours minutes-hours hours-days minutes-hours (one cell line) Manually Number of states Number of states, Number of states, Number of states entered binarization cutoff data transformation parameters and smoothing B Benchmark Cell/tissue types GenoSTAN Chromatin marks Benchmarked model (ID) models (ID) K562 Poilog-K562 H3K4me1, H3K4me2, ChromHMM-ENCODE I (ENCODE) nb-K562 H3K4me3, ChromHMM-Nature H3K36me3, H3K9ac, Segway-ENCODE H3K27ac, H3K27me3, Segway-nmeth H4K20me1, P300, Segway-Reg.Build DNase-Seq EpicSeg 127 cell/tissue types Poilog-127 H3K4me1, H3K4me3, ChromHMM-15 II (ENCODE, Roadmap nb-127 H3K36me3, ChromHMM-25 Epigenomics) H3K27me3, H3K9me3

20 cell/tissue types Poilog-20 H3K4me1, H3K4me3, ChromHMM-15 III (ENCODE, Roadmap nb-20 H3K36me3, H3K9ac, ChromHMM-18 Epigenomics) H3K27ac, H3K27me3, ChromHMM-25 H3K9me3, DNase-Seq

Figure 1: Overview of chromatin state annotation methods and study design. (A) Comparison of features of GenoSTAN against three previous chromatin state annotation algorithms. (B) Description of the three benchmark sets used in this study. GenoSTAN is benchmarked against published chromatin state annotations using ChromHMM (’ChromHMM-ENCODE’ [13, 27], ’ChromHMM-Nature’ [20], ’ChromHMM-15’, ’-18’ and ’-25’ [12]), Segway (’Segway-ENCODE’ [13, 27], ’Segway-nmeth’ [26] and ’Segway-Reg.Build’ [64]) and EpicSeg [43].

25 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A 47.17 mb 47.19 mb 47.21 mb 47.23 mb 47.25 mb Genomic position 47.18 mb 47.2 mb 47.22 mb 47.24 mb

H3K4me1

H3K4me3

DNase-Seq TAL1

UCSC genes

+ GRO-cap TSS - GenoSTAN-PoiLog-K562 GenoSTAN-NB-K562

ChromHMM-ENCODE Segway-ENCODE EpicSeg

Promoter Promoter & downstream Downstream TAL1 promoter & TAL1 enhancer TAL1 enhancer upstream enhancer Enhancer B

Prom.11 58 31 5 9 14 35 297 452 203 194 11.4 1.4 0.2 0.01 28.97 32.82 55.98 52.1 1.6 0.72 0.08

PromWF.5 30 14 3 3 4 33 89 93 29 49 13.1 0.6 0.5 0.1 15.05 9.11 22.1 13.83 0.89 0.38 0.16

PromF.3 16 16 5 21 22 51 119 66 45 56 12.3 0.6 1 0.01 0.99 2.57 2.1 2.94 1.71 1.28 0.32

Enh.15 81 55 7 14 12 92 149 42 190 76 10.9 0.6 3.6 0.04 1.41 23.7 6.35 5.1 0.27 0.43 0.08

EnhWF.2 33 27 5 12 10 65 56 16 58 33 34.5 0.6 6.2 0.3 0.78 18.39 3.32 4.25 0.86 1.46 0.67

EnhF.10 9 11 3 8 4 29 15 7 16 14 58.3 0.4 5.6 0.82 1.85 3.1 1.87 2.38 1.22 2.43 0.87

ReprEnh.4 38 30 39 51 62 41 21 21 29 44 1.5 1.2 6.2 0.07 0.2 0.29 0.04 0.63 0.41 0.27 0.3

Gen5'.13 17 15 7 19 18 46 24 9 21 23 46.7 0.6 4.6 0.35 1.14 3.73 0.91 3.92 4.18 4.45 3.13

Elon'.14 10 9 5 29 16 14 6 5 11 14 59.9 1 4.9 0.14 3.48 0.82 0.37 4.16 26.91 14.83 21.41

Elon.6 5 7 3 16 8 5 3 3 5 6 70.4 1.8 6.2 0.78 2.06 0.43 0.52 2.63 43.23 35.92 39.9 ReprD.17 20 10 27 8 18 31 18 9 8 18 20.8 0.8 5.2 0.73 8.51 1.98 0.59 1.33 0.51 0.34 0.45

Repr.7 10 7 28 8 16 8 6 5 5 9 88.6 1 9.7 5.44 9.66 0.34 0.22 0.68 1.63 1.94 2.85

ReprW.9 5 6 12 4 8 3 3 3 3 5 200.7 1.4 13.1 17.06 6.39 0.19 0.34 0.8 1.82 5.38 3.58

Low.12 9 10 8 9 8 15 9 5 8 11 96.9 0.8 8.2 2.75 2.96 1.91 1.04 2.3 3.01 5.36 4.88

Low.8 4 6 3 4 4 9 3 3 3 6 132.5 0.8 8.2 3.91 6.1 0.31 2 1.78 2.43 7.77 4.04

Low.1 3 5 4 4 4 2 0 2 3 2 458.3 1.2 22.7 31.68 2.67 0.13 0.66 0.78 7.64 12.67 14.08

Low.16 2 3 1 1 0 1 0 1 0 1 440 0.6 24.7 23.1 3.44 0.05 1.23 0.35 1.56 4.03 3.07

Low.18 0 0 0 0 0 0 0 0 0 0 88.2 0.6 21.9 12.72 4.34 0.12 0.37 0.05 0.12 0.34 0.12

p300 exon intron DNase H3K9ac 3' UTR H3K4me1 H3K4me2 H3K4me3 H3K27ac intergenic H3K27me3H3K36me3H4K20me1 CpG island stable TSSTSS/5' UTR median read count instable TSS median distance #segmentsmedian x 1000 width (kb) to GENCODE TSS

0 3 20 150

Figure 2: Chromatin states ﬁtted on Benchmark I using GenoSTAN. (A) GenoSTAN segmentations are shown with published segmentations using ChromHMM-ENCODE [13], Segway-ENCODE [13] and EpicSeg [43] at the TAL1 gene and three known enhancers. GenoSTAN-Poilog-K562 correctly recalls all known promoter and enhancer regions. GenoSTAN-nb-K562 misses the upstream enhancer. ChromHMM-ENCODE misclassiﬁes most of the downstream enhancer region as promoter. (B) Median read coverage of GenoSTAN-Poilog-K562 chromatin states (left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE TSS (middle). The right panel shows recall of genomic regions by chromatin states. 26 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A GRO−cap TSS B HOT regions C Fraction of segments bound

● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●● ● ●● ●●●●●● ●●●●● ● ● ● ● ● ● ●●●●●●●●● ●● ● ● ● ●● 0.2 0.4 0.6 ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● YY1 ●● ●● ● ● ● ● ●●● ● ● ● ● ● E2F6 ● ● ● ● ● ●3 ● ● ● ELF1 ● ● ● ● ● 13● ● ● Egr1 ● ● ● 11_proximal● ● ZBTB7A 2_Weak_Promoter● ● ●● CCNT2 PromP● ● ● Enh.6 ● ● HMGN3 ● ● 0 Enh4_Strong_Enhancer● ● ● IRF1 ● PromF● ● PromW.5 ● ●● KDM5B ● ●● ● ● PHF8 ● ● 1_proximal TAF1 Enh.15 ● TBP ● ● ● ● ● ● RBBP5 Enh1 ● UBF ● ● ● ETS1 15● ● Sin3A ● ● Prom.16 ●1 ● E2F4 Tss 0.6 0.8 1.0

0.6 0.8● 1.0 11_proximal● HDAC1 ● ● SAP30 ● ● 14_Repetitive/CNV● CHD1 Enh ● ●2 0 ● Prom.11 3 SP1 ● ● CHD2 ● ● 1_Active_Promoter● 7_tss81_Active_Promoter● ReprEnh.13 Mxi1 ● 1_proximal● THAP1 PromF● SIX5 ReprEnh.4 ● ● Enh.6 Nrf1 Tss●● NFYB Tss.22 ● Prom.11 ● Enh1 NFYA GenoSTAN-PoiLog-K562 4_Strong_Enhancer● SP2 Cumulative recall (HOT) Cumulative Cumulative recall (TSS) Cumulative Tss● USF2 GenoSTAN-NB-K562 GTF2F1 ●1 GTF2B ChromHMM-Nature TAF7 ●8 Pol2(phosphoS2) ● ChromHMM-ENCODE ● 7_tss ●2 SRF Prom.22 Tss● ELK1 ● Segway-ENCODE Enh.15 BCLAF1 ZBTB33 Segway-nmeth SETDB1 Art● Znf143 Segway-Reg.Build CTCF Rad21 EpicSeg SMC3 ● ● ● ● ● ● ● ● ● ●● ●● ● CTCFL 0.0 0.2 0.4 0.0 0.2 0.4 ZNF263 Myc Max Pol2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 MAZ BHLHE40 JunD Cumulative false discovery rate (HOT) RCOR1 Cumulative false discovery rate (TSS) CBX3 PML GABPA ATF3 D EF SPI1 USF1 *** Jun *** * JunB 2.0 * FOSL1 *** ** ** FOS * EP300 * TEAD4 * * GATA2 TAL1 2 NR2F2 TRIM28 1.5 STAT5A TBL1XR1 REST ATF1 CEBPB *** ARID3A *** HDAC2 1.0 COREST GATA1 1 STAT2 STAT1

0.4 0.5 0.6 0.7 MEF2A ● MafF MafK Bach1 0.5 NFE2 HDAC8 BCL3 ZNF274 ● SIRT6 ● Enhancer activity 0 SMARCA4 PromoterEnhancer states states ● ● SMARCB1 ● GenoSTAN-PoiLog ● 0.0 Pol3 ● BRF1 Recall of FANTOM5 enhancer Recall of FANTOM5 ● GenoSTAN-NB BDP1 ● POLR3A ● ChromHMM−ENCODE GTF3C2 RFX5 ● ChromHMM−Nature Enhancer activity (GenoSTAN-PoiLog) HDAC6 RD ● Segway−ENCODE −0.5 EZH2 ● −1 BRF2 Segway−nmeth KAP1 ● Segway−Reg.Build NR2C2 ● EpicSeg

0.0 0.1 0.2 0.3 −1.0 EpicSeg 0.0 0.2 0.4 0.6 0.8 1.0 Segway-nmeth GenoSTAN-NB Low Segway-Reg.BuildSegway-ENCODE Enh.2 ChromHMM-Nature GenoSTAN-PoiLog Recall of FANTOM5 promoter Enh.15 (n=210) (n=337) EpicSeg ChromHMM-ENCODE (n=281) (n=271) (n=236) (n=248) (n=137) (n=306) (n=210) (n=203)Repressed(n=193) (n=262) GenoSTAN−NB Segway−nmeth GenoSTAN−PoiLog ChromHMM−Nature Segway−ENCODE Segway−Reg.Build ChromHMM−ENCODE

Figure 3: Comparison of GenoSTAN to other published segmentations on benchmark set I. (A) Performance of chromatin states in recovering GRO-cap transcription start sites. Cumulative FDR and recall are calculated by subsequently adding states (in order of increasing FDR). (B) The same as in (A) for ENCODE HOT regions. (C) The fraction of predicted enhancer segments bound by individual TFs is shown for different studies. GenoSTAN enhancers are more frequently bound by TFs than those from other studies. (D) Recall of FANTOM5 promoters and enhancers which are active in K562 (i.e. overlapping with a GRO-cap TSS and an ENCODE DNase hyper- snesitivity site) by predicted promoters and enhancers is plotted to assess how well models distinguish promoters from enhancers. (E) Predicted enhancers show significantly higher activity than repressed and low coverage regions as measured by a reporter assay (’*’, ’**’ and ’***’ indicate p-values <0.05, 0,01 and 0.001). (F) Comparison of experimetal measures of enhancer activity between different studies.

27 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

GRO−cap TSSs (K562, Gm12878) A FANTOM5 CAGE tags (127 cell lines) B E Fraction of segments bound

●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●● ●●● ● ● ●●● ● ● ● ●● ●●● ● ●● ●●●●●●●●● Fitted on all 127 cell types with the ● ●●● ● ●●●●●● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●● ●● ● ● ● ● 5 Roadmap Epigenomics core marks Enh.12● ● ● ● ● ● ●● ● ● ● Enh.6 ● (H3K4me1, H3K4me3, H3K36me3, ● ●● ● 0.1 0.3 0.5 ● ● ● ● Enh.9 ● ●● ● YY1 H3K27me3, H3K9me3) ● ● 3_TxnFlnk ● ● E2F6 ● ● 2_TssAFlnk Enh.9 2_TssFlnk ELF1 ● ● ● Egr1 ● ZBTB7A ● ● ● CCNT2 ● ● 3_TssFlnkU ReprEnh.11 ● HMGN3 ● Elon.13 IRF1 KDM5B ● Prom.19 ● ● 2_PromU PHF8 Elon.7 ● ● ● ● TAF1 Prom.1 TBP Prom.15 ● RBBP5

0.6 0.8 1.0 UBF 0.6 0.8 1.0 ● Prom.14 ETS1 ● 1_TssA Sin3A E2F4 Prom.6 ● HDAC1 ● 1_TssA 3_TxnFlnk●● SAP30 ReprEnh.2 ●● ● Prom.5 CHD1 ● SP1 ● Prom.21 ● CHD2 1_TssA ●10_TssBiv 3_PromD1 ● ● Mxi1 Prom.19 THAP1 ● 2_PromU Prom.19 Prom.5 ● ● SIX5 Cumulative recall (TSS) Cumulative Nrf1 3_PromD1 GenoSTAN−PoiLog-20 NFYB ● NFYA Cumulative recall (CAGE tags) recall (CAGE Cumulative SP2 GenoSTAN−NB-20 USF2 GenoSTAN−PoiLog-127 GTF2F1 GTF2B ● 1_TssA ● 1_TssA TAF7 GenoSTAN−NB-127 Pol2(phosphoS2) GenoSTAN−PoiLog-127 SRF ChromHMM−15 ELK1 GenoSTAN−NB-127 BCLAF1 ChromHMM−18 ZBTB33 ChromHMM−15 SETDB1 ● ●●● ● ● ● Znf143 ● ● ● ● ChromHMM−25

ChromHMM−25 0.0 0.2 0.4 CTCF 0.0 0.2 0.4 Rad21 SMC3 CTCFL 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0 ZNF263 Myc Max Pol2 Cumulative false discovery rate (CAGE tags) Cumulative false discovery rate (TSS) MAZ BHLHE40 JunD RCOR1 CBX3 HOT regions PML GABPA C (K562, Gm12878, H1hesc, HeLa−S3, HepG2) D 20 cell lines ATF3 SPI1 USF1 ●● ● ●●●●●●●●●●●● Jun ●●●●●● ● ●● 0.8 GenoSTAN−PoiLog-20 ● ● ●● ● JunB ●●● ● ● ●●●● ● ● ●●● FOSL1 ● ● ● ●●● FOS GenoSTAN−NB-20 ● ● EP300 ● ● ● TEAD4 GenoSTAN−PoiLog-127 ● ● ● GATA2 TAL1 ● ●● ● GenoSTAN−NB-127 ● NR2F2 ● ● ● ● TRIM28 ChromHMM−15 ● ● ● ● STAT5A ● ● ● TBL1XR1 ● ● ChromHMM−18 ● ● ● ● REST ● ● ATF1 ● ChromHMM−25 ● CEBPB ● ARID3A HDAC2 ● ● COREST ● ●● ReprEnh.2 ● GATA1 ● ●● 0.6 0.8 1.0 ● ● STAT2 Enh.12 ● ● STAT1 ● ● ● MEF2A PromP.11● MafF Enh.12● ● ● ●● MafK 0.4 0.6 Enh.6 ● Bach1 ● PromF.4 ● ● NFE2 ● ● HDAC8 ● 2_TssAFlnk ● BCL3 Enh.9 ZNF274 ● 8_EnhG2● SIRT6 Prom.21 ● 3_TssFlnkU SMARCA4

Cumulative recall (HOT) Cumulative ● 9_TxReg● SMARCB1 PromoterEnhancer states states Pol3 Recall of FANTOM5 enhancer Recall of FANTOM5 2_PromU ● BRF1 Prom.5 ● GenoSTAN−PoiLog−20 ● ● BDP1 ● ● ● POLR3A Prom.19 ● ● 10_TssBiv ● GenoSTAN−NB−20 ●Prom.6 GTF3C2 ● ● 1_TssA RFX5 1_TssA GenoSTAN−PoiLog−127 HDAC6 Enh.9 ● 3_PromD1 ● GenoSTAN−NB−127 RD EZH2 ChromHMM−15 BRF2 KAP1 ● ChromHMM−18 NR2C2 ● ●● ● ● ● ● ChromHMM−25 0.0 0.2 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 ChromHMM-15ChromHMM-18 ChromHMM-25 GenoSTAN-nb-20 GenoSTAN-nb-127 GenoSTAN-Poilog-20 Cumulative false discovery rate (HOT) Recall of FANTOM5 promoter GenoSTAN-Poilog-127

Figure 4: Comparison of GenoSTAN to other published segmentations on benchmark II and III. (A) Performance of chromatin states in recovering FANTOM5 CAGE tags in 127 cell types. CAGE tags were verlapped with chromatin states wihout the use of cell type information. Cumulative FDR and recall are calculated by subsequently adding states (in order of increasing FDR). (B) Performance of chromatin states in recovering GRO-cap transcription start sites in two cell types where GRO-cap data was available. (C) The same as in (B) for ENCODE HOT regions for ﬁve cell types where annotation of HOT regions was available. (D) Recall of FANTOM5 promoters and enhancers by predicted promoters and enhancersis plotted to assess how well models distinguish promoters from enhancers. (E) The fraction of predicted enhancer segments bound by individual TFs is shown for diﬀerent studies. GenoSTAN enhancers are more frequently bound by TFs than those from other studies.

28 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

ES cell Mesenchyme Digestive A C iPSC Neurosphere Other Method SNPs recalled by enhancer SNP frequency per bp in ES−derived Brain ENCODE 2012 (genomic coverage) state (sensitivity) enhancer state (precision) Blood & T cell GenoSTAN−PoiLog−20 Adipose 0 10 20 30 (2%) HSC & B cell Heart -log10(p-value)

GenoSTAN−NB−20 * ES−WA7 cells (E002) (2%) * H9 cells (E008) * ES−UCSF4 cells (E024) * HUES6 cells (E015) ChromHMM−25 ** iPS−20b cells (E020) (2%) * iPS−15b cells (E018) ** iPS DF 6.9 cells (E021) * * H1 BMP4 derived trophoblast (E005) ChromHMM−18 * HUES64 derived CD56+ mesoderm (E013) (2%) * * H1 derived neuronal progenitor cultured cells (E007) * Primary mononuclear cells (from PB) (E062) *** Primary T cells from primary blood (from PB) (E034) ChromHMM−15 *** Primary T helper cells (from PB) (E043) (2%) *** Primary T helper naive cells (from PB) (E039) **** Primary T helper naive cells (from PB) (E038) Primary T CD8+ naive cells (from PB) (E047) GenoSTAN−PoiLog−127 *** ** Primary T regulatory cells (from PB) (E044) (2%) **** Primary T helper 17 cells PMA−I stimulated (E042) *** Primary T cells effector/memory enriched (PB) (E045) Primary T helper cells PMA−I stimulated (E041) GenoSTAN−NB−127 *** *** Primary T cells from cord blood (E033) (2%) **** Primary T CD8+ memory cells (from PB) (E048) *** Primary T helper memory cells (from PB) (E040) *** Primary T helper memory cells (from PB) (E037) ** * Primary monocytes (from PB) (E029) 0.04 0.02 0 5e−06 1e−05 1.5e−05 **** Primary HSCs G−CSF−mobilized male (E051) *** Primary HSCs G−CSF−mobilized female (E050) ** Primary haematopoietic stem cells (HSCs) (E035) **** Primary HSCs short term culture (E036) *** Primary natural killer cells (from PB) (E046) ******* Primary B cells from cord blood (E031) B Method SNPs recalled by promoter SNP frequency per bp in ******* Primary B cells (from PB) (E032) (genomic coverage) state (sensitivity) promoter state (precision) ** Mesenchymal stem cell deriv. chondrocyte (E049) ** Adipose−derived mesenchymal stem cells (E025) GenoSTAN−PoiLog−20 * Ganglion eminence derived neurospheres (E054) Thymus (E112) (1%) * * * Fetal thymus (E093) * Brain hippocampus middle (E071) GenoSTAN−NB−20 * Brain inferior temporal lobe (E072) (1%) * Brain cingulate gyrus (E069) * * Fetal brain male (E081) * Brain germinal matrix (E070) ChromHMM−25 * Adipose nuclei (E063) (1%) ** Fetal heart (E083) * Fetal intestine small (E085) ** Fetal intestine large (E084) ChromHMM−18 * Rectal mucosa donor 31 (E102) (1%) * * Sigmoid colon (E106) * Stomach mucosa (E110) * Rectal mucosa donor 29 (E101) ChromHMM−15 * Small intestine (E109) (1%) *** Liver (E066) ** Pancreatic islets (E087) * * Pancreas (E098) GenoSTAN−PoiLog−127 * Placenta (E091) (1%) * * Lung (E096) * * Placenta amnion (E099) * * Fetal kidney (E086) GenoSTAN−NB−127 ***** * GM12878 lymphoblastoid (E116) (1%) ** Monocytes−CD14+ RO01746 (E124) * Dnd41 T cell leukaemia (E115) * Osteoblast (E129) *** HepG2 hepatocellular carcinoma (E118) K562 leukaemia (E123) 0.02 0.01 0 5e−06 1e−05 1.5e−05 *

Height

QT interval Urate levels Endometriosis Ovarian cancer Type 2 diabetes LDL cholesterol Platelet counts Type 1 diabetes Cholesterol, totalHDL cholesterol IgG glycosylationUlcerativeCrohn's colitis diseaseMultiple sclerosis Colorectal cancerSystemic sclerosis F−cell distribution Alzheimer's disease Mean platelet volume Rheumatoid arthritis Iron status biomarkersPrimary biliary cirrhosis Electrocardiographic traitsMean corpuscular volume Glycated hemoglobin levels Inflammatory bowel disease Lipid metabolism phenotypes Systemic lupus erythematosus Age−related macular degeneration

Figure 5: Enrichments of genetic variants associated with diverse traits in enhancers and promoters are specific to the relevant cell types or tissues. (A) Median SNP recall and frequency was calculated for enhancer states in different segmentations by restricting it to a total genomic coverage of 2% (100 samples of random subsetting) to control for different number of enhancer calls between the segmentations. Error bars show the 95% confidence interval. (B) The same as in (A) but for promoters. (C) The heatmap shows the -log10(p-value) of significantly enriched traits in enhancer states (GenoSTAN-Poilog-127, p-value < 0.001, marked by ’*’). Only cell types and tissues where at least one trait was significantly enriched are shown. P-values were adjusted for multiple testing using the Benjamin-Hochberg correction.

29 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

TF co-binding TF enrichment YY1, E2F2, ELF1, Egr1, ZBTB7A, CCNT2, HMGNR3, IRF1, KDM5B, PRF8, Promoter module 1 TAF1, TBP, RBBP5, UBF, ETS1, (strong co-binding) Sin3A, E2F4, HDAC1, SAP30, CHD1, SP1, CHD2, Mxi1 THAP1, SIX5, Nrf1, NFYB, NFYA, SP2, USF2, GTF2F1,GTF2FB, Promoter module 2 TAF7, Pol2-S2P, SRF, ELK1, BCLAF1, (weak co-binding) ZBTB33, SETDB1 Znf143, CTCF, Rad21, SMC3, CTCFL CTCF/cohesin complex ZNF263, Myc, Max, Pol2, MAZ, BHLHE40, Shared Promoter- JunD, RCOR1, CBX3, PML, GABPA, ATF3, Enhancer module SPI1, USF1 Jun, JunB, FOSL1, FOS AP-1 complex EP300, TEAD4, GATA2, TAL1, NR2F2, Enhancer module 1 TRIM28, STAT5A, TBL1XR1, REST, ATF1, (strong co-binding) CEBPB, ARID3A, HDAC2 COREST, GATA1, STAT2, STAT1, MEF2A, MafF, MafK, Bach1, Enhancer module 2 NFE2, HDAC8, BCL3, ZNF274, SIRT6, (weak co-binding) SMARCA4, SMARCB1 Pol3, BRF1, BDP1, POLR3A, GTF3C2, RFX5 Pol3 module HDAC6, RD, EZH2, BRF2, KAP1, NR2C2 Chromatin repression POLR3A, GTF3C2, RFX5 GTF3C2, POLR3A, BDP1, Pol3, BRF1, BRF2, KAP1, NR2C2 BRF2, KAP1, HDAC6, RD, EZH2, Rad21, SMC3, CTCFL CTCF, Znf143, SMARCA4, SMARCB1 SMARCA4, ZNF274, SIRT6, NFE2, HDAC8, BCL3, Bach1, MafK, MafF, STAT2, STAT1, MEF2A, COREST, GATA1, Jun, JunB, FOSL1, FOS SPI1, USF1 ATF3, PML, GABPA, JunD, RCOR1, CBX3, BHLHE40, Pol2, MAZ, ZNF263, Myc, Max, ZBTB33, SETDB1 ZBTB33, SRF, ELK1, BCLAF1, TAF7, Pol2-S2P, USF2, GTF2F1,GTF2FB, NFYB, NFYA, SP2, THAP1, SIX5, Nrf1, CHD2, Mxi1 SAP30, CHD1, SP1, HDAC1, Sin3A, E2F4, UBF, ETS1, RBBP5, TAF1, TBP, PRF8, IRF1, KDM5B, CCNT2, HMGNR3, ZBTB7A, Egr1, ELF1, YY1, E2F2, CEBPB, ARID3A, HDAC2 ARID3A, CEBPB, REST, ATF1, TBL1XR1, TRIM28, STAT5A, NR2F2, TAL1, GATA2, TEAD4, EP300,

Enrichment (fraction) Enh.15 EnhW.2 Prom.11 EnhF.10 PromF.3 of TF (co-)binding PromW.5 ReprEnh.4

0 0.2 0.6 1

Figure 6: Promoters and enhancers have a distinctive TF regulatory landscape. Co-binding (left) and enrichment of transcription factor binding sites (right) in chromatin states for 101 transcription factor in K562 reveals TF regulatory modules with distinct binding preferences for promoters, enhancers and repressed regions.

30 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Figures

Prom.22 57 33 7 11 16 40 351 529 274 230 12.8 0.8 0.4 0 18.13 26.5 32.09 34.8 1.17 0.53 0.07 Prom.16 107 27 3 2 4 29 140 261 87 109 12.1 0.6 0.1 0.02 19.5 13.72 39.89 27.12 0.71 0.18 0.03 PromWF.12 21 18 5 13 14 51 149 95 58 69 24.7 0.4 0.8 0.05 5.46 8.92 7.75 7.68 1.75 1.29 0.29 PromF.17 7 9 3 5 4 23 36 19 13 18 22.1 0.4 1.4 0.19 2.38 1.58 3.36 1.71 0.57 0.74 0.18 Enh.6 82 56 6 12 12 81 107 28 124 53 18.6 0.6 5.8 0.1 0.98 27.82 5.07 4.31 0.45 0.67 0.25 EnhTxn19 19 16 7 20 16 72 39 14 37 36 38.4 0.4 3.6 0.2 1.11 4.22 1.48 4.01 2.52 2.3 1.4 EnhWF.20 26 24 5 10 10 43 33 9 26 18 42.7 0.4 7.9 0.44 0.23 9.19 1.27 1.7 0.72 1.63 0.73 EnhF.18 7 6 5 7 8 20 6 5 5 13 97.4 0.6 5.7 1.55 6.85 0.3 1.23 1.95 4.39 4.83 5.38 ReprEnh.13 34 29 35 49 56 35 21 20 26 38 2.6 0.8 6.2 0.09 0.23 0.36 0.1 0.77 0.62 0.34 0.36 Gen5'.7 13 11 7 22 16 29 12 7 13 19 86.5 0.6 5.1 0.52 2.17 1.14 0.93 4.21 10.41 8.07 8.79 Elon.15 8 8 5 34 14 7 3 5 8 11 73.1 0.8 4.9 0.14 1.62 0.51 0.17 2.33 31.47 13.63 22.22 Elon.21 4 7 3 13 6 3 0 3 3 5 108.2 1 7 1.19 0.58 0.18 0.39 1.26 28.97 25.85 30.94 ReprD.4 24 10 24 7 16 45 24 11 8 20 17.9 0.6 4.3 0.45 9.4 1.99 1.28 1.6 0.46 0.2 0.41 ReprD.9 13 7 35 8 18 11 6 6 5 12 76.4 0.6 8.3 2.67 8.43 0.24 0.19 0.53 1.19 1.02 1.65 ReprW.3 7 5 24 7 12 4 3 4 5 7 155.3 0.8 11.8 6.22 3.21 0.01 0.09 0.22 0.71 1.38 1.08 Low.11 13 16 7 8 8 17 12 6 8 8 72.5 0.4 10.2 1.18 0.94 2.39 0.51 1.02 0.72 1.73 1.07 Low.5 6 10 4 11 8 10 3 4 5 8 118.1 0.6 8.6 1.25 0.6 0.27 0.7 1.91 5 12.41 7.18 Low.14 6 8 12 7 8 7 6 4 5 6 162.1 0.6 12.1 4.66 1.79 0.27 0.21 0.6 1.06 2.68 2.43 Low.8 4 5 9 4 8 2 0 2 3 4 305.9 0.8 14.9 16.59 1.68 0.03 0.26 0.41 1.03 3.74 2.25 Low.10 2 5 3 3 4 7 3 2 3 5 209.3 0.6 9.9 4.43 7.27 0.13 1.35 1.17 1.81 7.53 4.12 Low.2 3 5 3 3 4 1 0 2 0 2 346.6 1.2 26.5 27.3 0.51 0.05 0.32 0.4 3.14 5.87 7.1 Low.1 2 2 1 1 0 1 0 1 0 1 352 0.6 25.1 17.64 2.33 0.03 0.87 0.23 1.02 2.97 1.93 Low.23 0 0 0 0 0 0 0 0 0 0 67.6 0.7 20.4 13.11 4.58 0.14 0.49 0.05 0.12 0.4 0.15

p300 exon intron DNase H3K9ac 3' UTR H3K4me1 H3K4me2 H3K4me3 H3K27ac intergenic H3K27me3H3K36me3H4K20me1 CpG island stable TSS instable TSS TSS/5' UTR median read count median distance #segmentsmedian x 1000 width (kb) to GENCODE TSS

0 3 20 150

Supplementary Figure 1: Median read coverage of GenoSTAN-nb-K562 chromatin states (left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE TSS (middle). The right panel shows recall of genomic regions by chromatin states.

31 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A Jaccard Index B

0 0.4 0.8

100 56 59 74 71 73 72 73 24 16 12 10 2 0 45 10 GenoSTAN-NB-K562

56 100 70 68 68 65 77 73 0 21 30 27 17 26 61 32 Segway-nmeth

Segway-ENCODE 59 70 100 87 90 78 81 79 37 4 18 10 2 1 0 6 0.8 1.0

74 68 87 100 88 83 81 82 42 13 31 0 4 4 37 14 Segway-Reg.Build

71 68 90 88 100 79 80 79 43 15 31 27 3 0 38 0 GenoSTAN-Poilog-K562

73 65 78 83 79 100 79 79 42 18 0 17 4 6 42 12 ChromHMM-Nature

72 77 81 81 80 79 100 79 53 25 36 35 0 19 60 33 ChromHMM-ENCODE

73 73 79 82 79 79 79 100 44 0 28 27 7 12 51 24 EpicSeg

24 0 37 42 43 42 53 44 100 43 54 50 40 41 43 33 Segway-nmeth

16 21 4 13 15 18 25 0 43 100 42 42 40 39 40 30 EpicSeg 0.4 0.6

12 30 18 31 31 0 36 28 54 42 100 76 61 57 49 48 ChromHMM-Nature

10 27 10 0 27 17 35 27 50 42 76 100 62 63 53 57 Segway-Reg.Build

2 17 2 4 3 4 0 7 40 40 61 62 100 68 43 45 ChromHMM-ENCODE

0 26 1 4 0 6 19 12 41 39 57 63 68 100 55 60 GenoSTAN-NB-K562 Overlap between state annotations (Jaccard Index) state annotations (Jaccard between Overlap 45 61 0 37 38 42 60 51 43 40 49 53 43 55 100 50 Segway-ENCODE

10 32 6 14 0 12 33 24 33 30 48 57 45 60 50 100 GenoSTAN-Poilog-K562 0.0 0.2

EpicSeg EicSeg

Segway-nmeth Segway-nmeth Segway-ENCODESegway-Reg.Build ChromHMM-Nature ChromHMM-NatureSegway-Reg.Build Segway-ENCODE GenoSTAN-NB-K562 Strong Promoter Strong Enhancer ChromHMM-ENCODE ChromHMM-ENCODEGenoSTAN-NB-K562 GenoSTAN-Poilog-K562 GenoSTAN-Poilog-K562

Supplementary Figure 2: (A) Heatmap of pairwise overlap (Jaccard index) of promoter (red) and enhancer (orange) state annotations from diﬀerent studies on benchmark I. Rows and columns were ordered by separate clustering of promoter and enhancer overlaps. (B) Distribution of pairwise Jaccard indices for strong promoters and enhancers (oﬀ-diagonal elements of promoter and enhancer sub-matrices from (A)).

32 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A TSS (18 state models) B TSS (23 state models)

●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●● ● ● ●● ●●●●●● ●●●●●●● ●●● ●●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PromWF.5 Enh.6 ● ● ● ● ● ● ●● ● ●

● ● ● ● Enh.15 ●

● ● ● ●

Prom.16 ●

0.6 0.8● 1.0 0.6 0.8● 1.0

●

● Prom.11 ●

● ● ●

● ● Prom.22 Cumulative recall (TSS) Cumulative recall (TSS) Cumulative ● GenoSTAN−PoiLog ● ● GenoSTAN−NB ● ChromHMM ChromHMM−eq Segway

●● ● ● ● ● EpicSeg ● ●●●● ● 0.0 0.2 0.4 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Cumulative false discovery rate (TSS) Cumulative false discovery rate (TSS) C D HOT regions (18 state models) HOT regions (23 state models)

● ● ●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●● ●●●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

0.6 0.8 1.0 ● 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ReprEnh.13 ● ● ● ● ● ReprEnh.4 ● ● Enh.6 ● ● Prom.11 ● ● ● ● Cumulative recall (HOT) Cumulative ● recall (HOT) Cumulative ● ● ● ● ● ● ● ● ● Prom.22 ● Enh.15 ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ●● ● ● 0.0 0.2 0.4 0.0 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Cumulative false discovery rate (HOT) Cumulative false discovery rate (HOT)

Supplementary Figure 3: Comparison of GenoSTAN to other methods using 18 and 23 states on the same data set in K562 (Benchmark I). (A-B) Performance of chromatin states in recovering GRO-cap transcription start sites for the 18 and 23 state models. Cumulative FDR and recall are calculated by subsequently adding states (sorted by decreasing precision). (C-D) The same as in (A-B) for ENCODE HOT regions.

33 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A B 1.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8● 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Fraction of promoter state overlapping with TSSs of promoter state overlapping Fraction

GenoSTAN−PoiLog ChromHMM−eq region with HOT of "best" state overlapping Fraction GenoSTAN−NB Segway 0.5 0.6 0.7 ChromHMM EpicSeg 0.6 0.7

10 15 20 25 30 10 15 20 25 30

Number of states Number of states

C D

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.985 0.990 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.94 0.96 0.98

● ● ● ● ● AUC (TSS) AUC

AUC (HOT) AUC ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● 0.88 0.90 0.92 0.960 0.965 0.970 0.975 0.980

10 15 20 25 30 10 15 20 25 30

Number of states Number of states

Supplementary Figure 4: Comparison of chromatin segmentation algorithms with respect to their ability to call GRO-cap transcription start sites (left panels) and ENCODE HOT regions (right panels), as a function of the state number used in the respective algorithm (x-axes). All models were learned on the data set of benchmark I. (A-B) For each model, the state with highest precision in recalling HOT (respectively TSS) regions is shown. (C-D) For each model, an area under curve (AUC) score (see Methods) is plotted to asses the spatial accuracy of a genome segmentation.

34 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A Prom.19 3 12 1 1 1 1 17.8 0.6 0.7 0.19 19.08 12.71 28.34 19.55 1.12 0.59 0.18 Prom.5 5 42 2 1 1 3 15.2 1.2 0.4 0.06 26.36 36.62 50.66 48.03 2.11 1.32 0.2 Enh.12 15 6 4 1 2 3 46 1 3.9 0.72 1.84 36.29 8.03 11.98 4.76 6.33 2.68 EnhW.9 8 2 1 1 1 3 61.4 0.6 7.2 1.6 1.4 6.78 2.15 2.88 0.95 2.27 1.2 Elon.4 3 2 10 1 2 3 29.6 1 4.4 0.06 0.43 0.7 0.27 2.95 17.76 7.64 11.12 Gen5'.16 4 1 3 1 1 1 36.4 1 4.9 0.14 0.78 0.81 0.54 2.6 5.04 8.24 5.3 ElonW.15 1 1 3 1 1 1 95 1 6.5 0.64 2.15 0.35 0.45 2.35 25.7 24.56 30.51 Elon.13 1 1 6 0 1 1 16.2 2 3.5 0.05 0.87 0.16 0.08 0.71 18.83 6.76 10.52 ElonW.3 84.8 0.6 6.1 0.52 3.61 0.08 0.65 0.69 11.87 13.68 15.63 0 3 7 20 1 0 2 0 1 1 ReprEnh.2 4 3 1 9 2 3 7 0.8 5.7 0.38 2.7 0.71 0.58 0.7 0.19 0.1 0.23 Median #reads per segment Repr.14 1 1 1 5 1 1 68.7 1.6 10.1 6.89 7.6 0.32 0.23 0.59 1.09 1.61 2.03 Low.6 3 2 2 2 3 3 40.1 0.8 10.2 1.49 0.92 1.62 0.7 0.98 0.97 2.2 1.53 Low.18 3 1 1 1 1 1 105.6 0.8 8.7 3.51 1.77 1.32 1.24 1.9 1.68 4.43 2.96 Low.1 1 1 1 1 1 1 231.4 1.2 19.8 17.13 2.57 0.18 0.74 0.96 2.51 6.73 6 Low.10 1 1 1 2 2 3 99 1 14.2 5.18 1.02 0.34 0.33 0.7 1.26 3.2 2.6 Low.7 0 1 1 0 2 1 72.4 3.6 33.2 21.07 2.03 0.12 0.31 0.19 0.87 1.65 1.51 Low.17 2 0 1 0 1 1 74.3 0.6 7.2 2.05 2.84 0.68 1.87 1.42 1.31 3.32 2.54 Low.8 1 0 0 2 1 1 99.9 2 11.7 15.63 10.28 0.11 0.26 0.45 1.05 2.45 1.8 Low.11 0 0 0 0 1 0 280.7 0.8 23.9 16.95 6.76 0.05 0.76 0.3 0.64 2.01 1.23 Low.20 0 0 0 0 0 0 63.8 0.8 13.4 5.73 4.98 0.06 1.79 0.07 0.3 0.92 0.23

B Prom.1 3 12 1 1 1 1 19.4 0.6 0.8 0.22 20.42 14.62 30.54 21.03 1.29 0.75 0.18 Prom.19 5 46 2 1 1 3 13.5 1.2 0.4 0.04 24.76 32.7 47.86 45.61 1.87 1.13 0.13 Enh.6 19 9 4 1 2 4 32.3 0.8 3.2 0.39 1.6 32.86 7.46 10.54 3.58 4.09 1.54 EnhW.8 10 3 2 1 1 3 75.3 0.6 6.8 1.36 1.15 11.23 2.61 4.28 1.86 4.42 2.1 EnhWF.9 4 1 1 1 1 1 117.5 0.6 7.2 2.3 1.86 2.68 1.93 3.19 2.52 6.18 3.72 Gen5'.16 3 2 9 1 2 3 39.5 1 4.6 0.1 0.63 1.07 0.58 4.27 17.33 10.02 11.61 ElonW.13 1 1 4 1 1 1 105.3 1 6 0.55 2.4 0.37 0.57 2.75 27.88 24.82 31.28 Elon.7 1 1 8 1 1 1 12.6 2.6 3.5 0.05 0.85 0.18 0.07 0.84 19.92 6.76 10.82 ElonW.15 1 0 2 0 1 1 85.6 0.6 5.9 0.48 3.47 0.09 0.54 0.7 11.55 12.99 15.65 0 3 7 20 ReprEnh.11 5 4 2 9 3 4 7.2 0.8 6 0.36 2.47 0.9 0.63 0.72 0.2 0.1 0.2 Median #reads per segment Repr.2 1 1 1 1 4 3 35.5 1 24.7 1.96 0.74 0.07 0.13 0.17 0.85 1.18 1.33 Repr.14 1 1 1 6 1 1 30.9 2.2 8.7 4.34 7.76 0.19 0.23 0.54 0.71 0.83 1.26 ReprW.10 1 1 1 4 1 1 119.2 1.4 12.7 10.51 3 0.31 0.25 0.59 1.12 2.55 2.32 Low.3 3 2 2 2 2 3 73.1 0.8 10.5 2.6 1.26 1.58 0.82 1.05 1.35 3.11 2.08 Low.17 2 1 1 1 1 1 91.5 0.6 7.9 2.7 2.55 0.53 1.45 1.56 1.84 4.66 3.26 Low.12 1 1 1 1 1 1 161.9 1 14.9 9.33 1.99 0.34 0.84 0.84 3.44 8.82 7.51 Low.18 0 1 0 0 1 1 77.3 4.2 31 29.88 2.76 0.08 0.25 0.32 0.55 1.48 1.17 Low.20 0 0 0 0 0 0 80.2 0.6 14.7 6.51 5.96 0.07 1.99 0.13 0.39 1.07 0.34 Low.5 0 0 0 0 1 0 212.7 0.6 17.5 11.53 4.8 0.09 1.02 0.5 0.99 3.19 2.14 Low.4 0 0 0 1 1 1 133.4 1.2 12.7 14.79 9.58 0.07 0.26 0.37 0.77 1.84 1.36

Input TSS exon intron 3' UTR intergenic H3K4me1 H3K4me3 H3K9me3 CpG island H3K36me3 H3K27me3 TSS (stable)TSS/5' UTR median distance median width (kb) #segments (x 1000)

to GENCODE TSS (kb)

Supplementary Figure 5: GenoSTAN models for benchmark II. (A) Median read coverage of GenoSTAN-Poilog- 127 (Benchmark set II, ﬁtted on the 127 ENCODE and Roadmap Epigenomics cell types and tissues) chromatin states (left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE TSSs of segments (middle). The right panel shows recall of genomic regions by chromatin states. (B) The same as (A) for GenoSTAN-nb-127.

35 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A Prom.15 1 15 0 0 0 9 12 9 1 12.7 0.4 0.2 0.05 15.08 9.17 28.08 18.38 0.66 0.21 0.07 Prom.6 5 55 2 1 2 48 53 12 4 16.7 0.8 0.4 0.02 24.11 33.34 45.14 43.64 1.64 0.95 0.12 PromWF.4 11 14 3 1 2 6 10 4 3 26.4 0.6 1.1 0.25 6.51 8.75 6.67 7.22 1.97 1.53 0.38 Enh.9 26 13 5 1 2 25 15 13 4 19.7 0.8 4.5 0.18 0.97 26.2 5.52 6.5 1 1.4 0.51 EnhF.13 15 5 3 1 2 6 5 5 3 63.8 0.6 6.4 0.77 0.68 12.09 2.35 4.98 2.96 4.45 2.4 EnhWF.10 6 2 2 1 2 3 3 2 3 142.9 0.4 7.5 2.42 1.75 4.65 1.97 4.06 2.84 6.86 3.89 DNaseF.12 3 1 0 0 1 1 1 2 1 20.8 0.4 5.8 0.29 2.61 1.38 2.17 1.23 0.3 0.51 0.27 Gen5'.7 4 2 10 1 2 3 3 2 3 55 0.8 4.6 0.08 0.53 0.74 0.38 3.2 16.46 9.09 10.88 Elon.18 1 1 5 1 2 1 2 1 1 133.4 0.8 5.7 0.49 2.18 0.33 0.45 2.42 32.62 24.76 33.05 Elon.14 1 1 9 1 2 1 2 1 1 9.8 1.8 3.2 0.02 0.39 0.11 0.02 0.48 11.97 3.33 5.84 ElonW.22 0 0 2 0 1 0 1 0 1 104.3 0.6 5.7 0.46 3.39 0.08 0.53 0.73 13.86 14.09 16.24 0 3 7 20 55 ReprD.8 4 2 1 7 2 3 2 2 3 24.3 0.6 7.5 1.02 3.82 0.92 0.58 0.78 0.33 0.3 0.35 Median #reads per segment Repr.2 1 1 1 7 2 1 1 1 1 36.3 1.2 8.9 2.56 4.59 0.03 0.06 0.18 0.39 0.45 0.56 ReprW.11 1 1 1 4 2 1 1 1 1 180.5 1 12.7 11.96 4.08 0.38 0.32 0.84 1.46 3.08 3.22 ReprW.16 0 0 0 2 1 0 0 0 1 52.9 0.8 10 3.63 7.28 0.02 0.08 0.08 0.26 0.41 0.3 Low.21 3 3 3 2 5 3 3 2 4 33.7 0.4 11.1 0.75 0.45 0.42 0.23 0.42 0.54 1.08 0.76 Low.20 2 1 2 2 2 1 2 1 3 113.7 0.4 10.8 2.52 0.69 0.31 0.6 0.7 1.7 4.34 3.61 Low.1 3 1 1 1 1 1 1 1 1 126.6 0.6 7.9 2.66 1.92 0.75 1.13 2.1 2.69 6.78 4.95 Low.17 1 1 1 1 2 0 1 1 1 109.3 2.2 24.5 12.51 1.56 0.09 0.44 0.36 2.42 6.02 4.81 Low.24 0 1 1 1 3 0 1 1 1 103.2 1.6 35.6 11.89 1.22 0.03 0.12 0.08 0.68 0.97 1.06 Low.19 1 1 1 1 1 0 1 1 1 103.4 0.4 9.6 1.89 1.23 0.08 0.57 0.53 1.1 2.93 2.51 Low.3 0 0 0 1 1 0 1 1 1 172.1 1.2 14.6 17.23 3.69 0.07 0.22 0.62 0.98 2.79 2.23 Low.5 0 0 0 0 0 0 0 0 0 63.5 0.4 8.4 1.16 1.73 0.04 0.8 0.27 0.47 1.33 0.96 Low.23 0 0 0 0 1 0 0 0 0 254.1 0.8 28.9 19.27 4.6 0.01 0.29 0.17 0.4 1.45 0.84 Low.25 0 0 0 0 0 0 0 0 0 78.5 0.6 14.2 5.92 4.93 0.01 1.3 0.02 0.32 0.89 0.19

Prom.14 1 16 0 0 0 10 13 13 1 14.2 0.4 0.2 0.05 17.73 10.95 33.7 22.62 0.69 0.2 0.06 B Prom.21 5 59 2 1 2 52 58 12 4 16.7 0.8 0.4 0.02 21.41 28.33 38.8 38.37 1.44 0.81 0.09 PromWF.15 10 16 3 1 2 6 11 4 3 25.6 0.4 1 0.21 7.19 8.72 7.31 7.4 1.79 1.38 0.31 Enh.9 27 16 5 1 2 34 20 17 4 15.7 0.6 3.8 0.11 1.14 24.02 5.52 5.98 0.69 0.93 0.33 EnhF.12 19 6 5 1 2 9 6 8 3 53.7 0.4 6.1 0.51 0.59 14.85 2.7 5.22 2.64 3.35 1.77 EnhWF.5 10 3 2 1 2 4 4 3 3 108.9 0.4 6.6 1.35 1.2 5.65 1.89 3.97 2.74 5.83 2.83 DNase.20 4 2 1 1 2 1 2 11 1 28.3 0.4 9 0.46 1.59 1.8 0.97 0.89 0.31 0.55 0.58 Gen5'.4 4 2 10 1 2 3 3 2 3 54.9 0.6 4.7 0.08 0.46 0.69 0.32 3 16.15 8.58 10.41 ElonW.16 1 1 5 1 2 1 2 1 3 132.3 0.8 5.6 0.46 2.15 0.34 0.34 2.35 35.09 24.56 33.99 Elon.19 1 1 9 0 2 1 1 1 1 6.5 1.9 2.9 0.01 0.27 0.08 0.02 0.24 8.27 2.16 3.77 ElonW.18 0 0 2 0 1 0 1 0 1 0.44 3.35 0.09 0.48 0.7 13.92 14.22 15.83

0 3 7 20 55 106.1 0.6 5.6 ReprD.23 0.51 2.87 0.5 0.45 0.49 0.23 0.14 0.22 Median #reads per segment 5 3 1 10 3 3 3 3 3 12.7 0.6 6.7 ReprD.6 1 1 1 7 2 1 1 1 1 27 1.6 8.4 2.65 5.28 0.05 0.08 0.26 0.46 0.48 0.82 ReprW.3 1 1 1 4 2 1 1 1 1 195.6 1 12.7 13.48 4.48 0.24 0.24 0.71 1.4 3.09 2.75 Low.2 3 3 3 2 5 3 3 2 4 38.1 0.4 10.9 0.77 0.53 0.47 0.33 0.46 0.62 1.25 0.87 Low.13 2 1 2 1 2 1 2 1 3 129.9 0.4 11.5 2.95 0.82 0.31 0.57 0.73 1.94 4.8 4.2 Low.8 5 1 1 2 2 1 2 2 3 95.1 0.6 8.9 2.62 1.8 1.52 0.87 1.65 1.01 2.32 1.85 Low.10 4 1 1 0 1 1 2 1 1 89.4 0.6 6.5 1.3 1.58 0.84 1.27 2.4 3.37 7.33 4.75 Low.22 1 1 1 1 2 0 1 1 1 115 2 16.6 12.85 2.08 0.1 0.5 0.6 3.14 7.92 7.14 Low.7 1 1 1 1 3 0 1 1 1 124.5 0.8 33.8 7.05 0.93 0.05 0.14 0.11 0.8 1.21 1.25 Low.11 1 0 0 1 1 0 1 1 1 161.6 0.4 10 3.63 2.54 0.28 0.94 1.13 1.4 3.68 3.28 Low.1 0 0 0 0 2 0 0 1 1 140.4 1.8 35.5 24.56 2.42 0.01 0.09 0.18 0.29 0.72 0.41 Low.17 0 0 0 0 0 0 0 0 0 151.8 0.4 13.7 4.88 3.71 0.05 1.12 0.35 0.67 2.13 1.39 Low.24 0 0 0 1 1 0 0 0 1 185.6 0.8 12.8 13.13 9.14 0.02 0.19 0.2 0.63 1.54 0.94 Low.25 0 0 0 0 0 0 0 0 0 76.4 0.6 15 5.91 4.74 0.01 1.18 0.01 0.31 0.83 0.18

Input DNase TSS exon intron H3K9ac H3K27ac 3' UTR H3K4me1H3K4me3 H3K9me3 intergenic H3K36me3H3K27me3 CpG island TSS (stable)TSS/5' UTR median distance median width (kb) #segments (x 1000)

to GENCODE TSS (kb)

Supplementary Figure 6: GenoSTAN models for benchmark III. (A) Median read coverage of GenoSTAN-Poilog- 20 (Benchmark set III, ﬁtted on the 20 ENCODE and Roadmap Epigenomics cell types and tissues) chromatin states (left), their number of annotated segments in the genome, their median width and distance to the closest GENCODE TSSs of segments (middle). The right panel shows recall of genomic regions by chromatin states. (B) The same as (A) for GenoSTAN-nb-20.

36 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

47.17 mb 47.19 mb 47.21 mb 47.23 mb Genomic position 47.18 mb 47.2 mb 47.22 mb 47.24 mb

H3K4me1

H3K4me3

DNase-Seq TAL1

UCSC genes

+ GRO-cap TSSs - Poilog-127 nb-127 Poilog-20

GenoSTAN nb-20

ChromHMM-25 ChromHMM-18 ChromHMM-15

Promoter Promoter & downstream Downstream TAL1 promoter & TAL1 enhancer TAL1 enhancer upstream enhancer Enhancer

Supplementary Figure 7: Chromatin states annotations for benchmark II and III GenoSTAN segmentations are shown with the three Roadmap Epigenomics ChromHMM segmentations with 15 (ChromHMM-15), 18 (ChromHMM-18) and 25 (ChromHMM-25) states. Shown is the TAL1 locus with segmentations and data from the K562 cell line.

37 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

IMR90 HSC & B cell Adipose A enhancer state C ES cell Mesenchyme Digestive iPSC Neurosphere Other 80 ES−derived Brain ENCODE 2012 0 2 4 6 8 10 Blood & T cell Heart -log10(p-value)

* IMR90 fetal lung fibroblasts (E017) **** HUES48 cells (E014) *** HUES64 cells (E016) ** H9 cells (E008) ** ES−WA7 cells (E002) * ** ES−I3 cells (E001) *** iPS−18 cells (E019) ** H1 BMP4 derived mesendoderm (E004) ** * H9 derived neuronal progenitor cultured cells (E009) ** HUES64 derived CD56+ mesoderm (E013) 40 60 * HUES64 derived CD184+ endoderm (E011) * * H9 derived neuron cultured cells (E010) * * HUES64 derived CD56+ ectoderm (E012) ** Primary T CD8+ memory cells (from PB) (E048) ** Primary T helper naive cells (from PB) (E038) Primary T cells effector/memory enriched (PB) (E045)

one cell/tissue type ** ** Primary T CD8+ naive cells (from PB) (E047) * Primary mononuclear cells (from PB) (E062)

#traits enriched in at least ** Primary T helper memory cells (from PB) (E040) ** Primary T helper memory cells (from PB) (E037) GenoSTAN−Poilog−127 (Enh.12) ** * Primary T helper cells PMA−I stimulated (E041) GenoSTAN−nb−127 (Enh.6) **** Primary T regulatory cells (from PB) (E044) * Primary T helper 17 cells PMA−I stimulated (E042) ChromHMM−15 (7_Enh) ** Primary T cells from primary blood (from PB) (E034) ** Primary HSCs short term culture (E036) ChromHMM−25 (13_EnhA1) Primary neutrophils (from PB) (E030)

0 20 ** * Primary monocytes (from PB) (E029) ** Primary B cells from cord blood (E031) *** Primary B cells (from PB) (E032) 0 0.01 0.02 0.03 0.04 0.05 *** Primary natural killer cells (from PB) (E046) ** Adipose−derived mesenchymal stem cells (E025) p-value *** Mesenchymal stem cell derived adipocyte (E023) * * Mesenchymal stem cell deriv. chondrocyte (E049) ** Foreskin fibroblast (E055) ** * Foreskin fibroblast (E056) ** Breast myoepithelial (E027) B promoter state ** Foreskin keratinocyte (E057) ** Foreskin keratinocyte (E058) * Breast vHMEC mammary epithelial (E028) Cortex derived neurospheres (E053) 25 * GenoSTAN−Poilog−127 (Prom.5) * Fetal brain female (E082) * Skeletal muscle female (E108) GenoSTAN−nb−127 (Prom.19) * Rectal smooth muscle (E103) ChromHMM−15 (1_TssA) * * Oesophagus (E079) * Fetal stomach (E092) ChromHMM−25 (1_TssA) * Fetal intestine large (E084) * Rectal mucosa donor 29 (E101) * * Colonic mucosa (E075) * Duodenum mucosa (E077) *** Liver (E066) ** Spleen (E113) ** Fetal adrenal gland (E080) * Pancreatic islets (E087) ** ** GM12878 lymphoblastoid (E116) **** Monocytes−CD14+ RO01746 (E124) * Osteoblast (E129) * NHDF−ad adult dermal fibroblast (E126) * NH−A astrocyte (E125) * * NHLF lung fibroblast (E128) 10 15 20 * ** HMEC mammary epithelial (E119) ** HeLa−S3 cervical carcinoma (E117) ** A549 EtOH 0.02pct lung carcinoma (E114) ** K562 leukaemia (E123) HepG2 hepatocellular carcinoma (E118)

one cell/tissue type **** ** Dnd41 T cell leukaemia (E115) * HSMM skeletal muscle myoblasts (E120) NHEK−epidermal keratinocyte (E127) #traits enriched in at least ** *

Height Autism

0 5 HIV−1 control Platelet countsType 2 diabetes LDL cholesterol IgG glycosylationMultiple sclerosis Adiponectin levels Cholesterol, total Obesity−related traits Alzheimer's disease 0 0.01 0.02 0.03 0.04 0.05 Blood metabolite levels Inflammatory bowel disease Lipid metabolism phenotypes Systemic lupus erythematosusChronic lymphocytic leukemia Alzheimer's disease biomarkers p-value Age−related macular degeneration Idiopathic membranous nephropathy

Supplementary Figure 8: Enrichments of genetic variants associated with diverse traits in enhancers and promoters are speciﬁc to the relevant cell types. (A) The number of traits which are enriched in enhancer states in at least one cell type or tissue is plotted for p-values < 0.05. (B) The same as in (A) but for promoters. (C) The heatmap shows the -log10(p-value) of signiﬁcantly enriched traits in promoter states (GenoSTAN-Poilog-127, p-value < 0.05, marked by ’*’). P-values were adjusted for multiple testing using the Benjamin-Hochberg correction.

38 bioRxiv preprint rmtr.()Nme fehne ttsprRampEieoissml ye D h aea n()for (C) for in (A) as in same as The (D) same The type. (B) sample Epigenomics group. Roadmap cell/tissue per Epigenomics sample states Roadmap and enhancer group per of tissue promoters. states on Number enhancers enhancer (C) and of promoters promoters. Number predicted of (A) number of type. Dependency 9: Figure Supplementary C A

Number of enhancers Number of enhancers doi: not certifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. 10000 20000 30000 40000 50000 10000 20000 30000 40000 50000 https://doi.org/10.1101/041020 0

ES−derived Brain ES cell derived Mesenchyme iPSC Digestive

Primary Tissue Myostatin Tissue group Sample type Muscle IMR90 Heart Cell line Smooth Muscle Other

HSC & B cell ; this versionpostedFebruary24,2016. ES cell Primary cultures ENCODE 2012 Adipose Blood & T cell Epithelial Primary cells Neurosphere Thymus 39 D Number of promoters Number of promoters B 10000 12000 14000 16000 18000 10000 12000 14000 16000 18000 6000 8000 6000 8000

Neurosphere Myostatin ES cell derived Smooth Muscle ES−derived Blood & T cell The copyrightholderforthispreprint(whichwas Primary cells Brain Sample type

Tissue group HSC & B cell Digestive iPSC Primary Tissue ENCODE 2012 ES cell Other Muscle Primary cultures Mesenchyme IMR90 Thymus Epithelial Cell line Heart Adipose bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplementary Tables

Benchmark I - K562 (one cell type) Method/segmentation #promoters #enhancers GenoSTAN-Poilog-K562 11,358 (Prom.11) 10,932 (Enh.15) GenoSTAN-nb-K562 12,829 (Prom.22) 18,551 (Enh.6) ChromHMM-Nature 16,118 (1_Active_Promoter) 30,492 (4_Strong_Enhancer) ChromHMM-ENCODE 16,452 (Tss) 22,323 (Enh) Segway-ENCODE 19,894 (Tss) 33,518 (Enh1) Segway-nmeth 25,812 (8) 80,043 (0) Segway-Reg.Build 13,668 (7_tss) 38,992 (11_proximal) EpicSeg 16,192 (2) 53,982 (3)

Benchmark II & III - 127 cell types and tissues Method/segmentation #promoters #enhancers GenoSTAN-Poilog-127 15,229 (Prom.5) 45,955 (Enh.12) GenoSTAN-nb-127 13,547 (Prom.19) 32,280 (Enh.6) GenoSTAN-Poilog-20 12,710 (Prom.15) 19,730 (Enh.9) GenoSTAN-nb-20 14,168 (Prom.14) 15,655 (Enh.9) ChromHMM-15 21,002 (1_TssA) 92,824 (7_Enh) ChromHMM-18 20,049 (1_TssA) 22,678 (9_EnhA1) ChromHMM-25 12,525 (1_TssA) 12,706 (13_EnhA1)

Supplementary Table 1: Number of promoter and enhancer states for the chromatin state annotations analyzed in this study. The original state name is given in brackets.

40 bioRxiv preprint doi: https://doi.org/10.1101/041020; this version posted February 24, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Benchmark I - K562 (one cell type) Method/segmentation promoter states enhancer states GenoSTAN-Poilog-K562 Prom.11, PromW.5 Enh.15, Enh.2 GenoSTAN-nb-K562 Prom.16, Prom.22 Enh.6, Enh.19 ChromHMM-Nature Tss, TssF Enh, EnhW ChromHMM-ENCODE 1_Active_Promoter, 2_Weak_Promoter 4_Strong_Enhancer, 5_Strong_Enhancer Segway-ENCODE Tss, PromF Enh1, Enh2 Segway-nmeth 8,6 0, 13 Segway-Reg.Build 7_tss, 0_proximal 1_proximal, 11_proximal EpicSeg 2 3

Benchmark II & III - 127 cell types and tissues Method/segmentation promoter states enhancer states GenoSTAN-Poilog-127 Prom.19, Prom.5 Enh.12, EnhW.9 GenoSTAN-nb-127 Prom.1, Prom.19 Enh.6, EnhW.8 GenoSTAN-Poilog-20 Prom.15, Prom.6 Enh.9, EnhF.13 GenoSTAN-nb-20 Prom.14, Prom.21 Enh.9, EnhF.12 ChromHMM-15 1_TssA, 2_PromU 13_EnhA1, 14_EnhA2 ChromHMM-18 1_TssA, 2_TssFlnk 9_EnhA1, 10_EnhA2 ChromHMM-25 1_TssA, 2_TssAFlnk 7_Enh, 6_EnhG

Supplementary Table 2: Table showing promoter and enhancers states used to calculate recall of FANTOM5 promoters and enhancers. Two promoter and enhancer states were used for each segmentation, except for the EpicSeg segmentation, which only ﬁtted one enhancer state.