Live Exploration and Global Analysis of Large Epigenomic Datasets Konstantin Halachev1, Hannah Bast2, Felipe Albrecht1, Thomas Lengauer1 and Christoph Bock1,3,4

Halachev et al. Genome Biology 2012, 13:R96 http://genomebiology.com/2012/13/10/R96 SOFTWARE Open Access EpiExplorer: live exploration and global analysis of large epigenomic datasets Konstantin Halachev1*, Hannah Bast2, Felipe Albrecht1, Thomas Lengauer1 and Christoph Bock1,3,4* Abstract Epigenome mapping consortia are generating resources of tremendous value for studying epigenetic regulation. To maximize their utility and impact, new tools are needed that facilitate interactive analysis of epigenome datasets. Here we describe EpiExplorer, a web tool for exploring genome and epigenome data on a genomic scale. We demonstrate EpiExplorer’s utility by describing a hypothesis-generating analysis of DNA hydroxymethylation in relation to public reference maps of the human epigenome. All EpiExplorer analyses are performed dynamically within seconds, using an efficient and versatile text indexing scheme that we introduce to bioinformatics. EpiExplorer is available at http://epiexplorer.mpi-inf.mpg.de. Rationale [11], Ensembl [12] and the WashU Human Epigenome Understanding gene regulation is an important goal in bio- Browser [13] lies in their intuitive interface, which allows medical research. Historically, much of what we know users to browse through the genome by representing it as about regulatory mechanisms has been discovered by a one-dimensional map with various annotation tracks. mechanism-focused studies on a small set of model genes This approach is powerful for visualizing individual gene [1,2]. High-throughput genomic mapping technologies loci, but the key concept of genomics - investigating have recently emerged as a complementary approach [3]; many genomic regions in concert - tends to get lost and large-scale community projects are now generating when working with genome browsers only. Therefore, comprehensive maps of genetic and epigenetic regulation complementary tools are needed that handle the com- for the human and mouse genomes [4-7]. Substantial plexity of large genomic datasets while maintaining the potential for discovery lies in better connecting mechan- interactive and user-friendly character of genome ism-focused studies to the wealth of functional genomics browsers. and epigenomics data that are being generated. A handful Existing tools do not fully address this need. For exam- of pilot studies highlight the value of combining high- ple, the UCSC Table Browser [14] and Ensembl BioMarts throughput and mechanism-focused research (for example, [15] provide user-friendly support for selecting and in [8-10]), but few research groups are equally proficient in downloading sets of genomic regions, but the analysis of bioinformatics, large-scale genomics and in-depth func- the downloaded data needs to be performed locally using tional analysis to conduct highly integrated studies of gene command-line tools, including BEDTools [16] and R/Bio- regulation. A new generation of software tools could conductor [17]. Workflow tools such as Galaxy [18], bridge this gap by enabling user-friendly navigation and Taverna [19] and the Genomic HyperBrowser [20] com- analysis of large genomic databases. bine user-friendliness and flexibility, but they require Genome browsers are currently the only software tools careful planning and tend to be too slow for performing for navigating through genome data that are widely used, truly interactive and exploratory analyses. Finally, enrich- not only by bioinformaticians but also by biomedical ment analysis servers such as GREAT [21] and Epi- researchers with little computational background. The GRAPH [22] are powerful tools for identifying significant strength of web tools such as the UCSC Genome Browser associations in large biological datasets, but they lack the flexibility to explore the observed enrichments in a * Correspondence: [email protected]; [email protected] dynamic and interactive fashion. 1Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, With EpiExplorer, we have developed a web server that Germany Full list of author information is available at the end of the article combines the interactive nature of genome browsers with © 2012 Halachev et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Halachev et al. Genome Biology 2012, 13:R96 Page 2 of 14 http://genomebiology.com/2012/13/10/R96 the region-based analytical approach of Galaxy, enabling sets of genomic regions and to use them with the same users to casually explore large-scale genomic datasets in flexibility as any of EpiExplorer’s default region sets. search of interesting functional associations. EpiExplorer We validated the utility of EpiExplorer by studying the does not aim to replace any existing tool; instead it facili- genome and epigenome characteristics of CpG islands, tates dynamic integration with tools such as the UCSC which is a well-understood topic [26]. As outlined in a Genome Browser, Galaxy and the Genomic HyperBrow- case study (see Text S1 and Figure S1 in Additional file 1) ser. Neither does EpiExplorer restrict the user as to how and its corresponding online tutorial on the supplemen- to search for relevant associations in the data - as enrich- tary website [27], EpiExplorer makes it easy to rediscover ment analysis tools do with their stringent statistical fra- the distinctive epigenetic characteristics of CpG islands, mework. Instead, EpiExplorer’s key strength lies in which have previously been studied using computational supporting exploratory hypothesis generation using a and experimental methods [28-31]. The entire analysis can broad range of genomic analyses performed in real time be performed in less than ten minutes without any bioin- over the Internet. Such exploratory analyses often pro- formatic training, guided by EpiExplorer’s context-specific vide a first indication of relevant associations that are visualizations. worth following up by in-depth statistical analysis using other software tools or by experimental validation in the Connecting a new epigenetic mark to large-scale wet lab. reference maps of the human epigenome To assess the utility of EpiExplorer for exploratory analysis Software and applications and hypothesis generation in a more advanced setting, we A method and software for genome-wide exploration investigated a recently discovered epigenetic mark. 5- and live analysis of large epigenomic datasets Hydroxymethylcytosine (5hmC) is a chemical variant of The EpiExplorer web server provides an interactive gate- normal (that is, non-hydroxylated) cytosine methylation. It way for exploring large-scale reference maps of the human was first observed in embryonic stem (ES) cells and in cer- and mouse genome. EpiExplorer is built around default tain types of neurons [32,33]. The conversion of cytosine and user-uploaded genomic region sets, which are sup- methylation into 5hmC is catalyzed by proteins of the plied as BED files. Before uploading data for EpiExplorer TET family. One TET protein (TET2) is frequently analysis, it is often useful to preprocess raw data with mutated in myeloid cancers [34], underlining the biomedi- application-specific tools. For example, ChIP-seq data may cal relevance of studying the role of 5hmC in gene be preprocessed with Cistrome [23] in order to derive a regulation. list of high-confidence peaks for the transcription factor or From the paper of Szulwach et al. [35], we obtained the epigenetic mark of interest. Similarly, RNA-seq data may genomic region coordinates for a total of 82,221 hotspots be preprocessed using Galaxy [18] in order to identify of 5hmC that the authors experimentally mapped in genomic regions that are differentially transcribed between human ES cells. We uploaded these hotspot regions into two cell types. EpiExplorer, where they are automatically annotated with Once the most meaningful BED file representation of default genomic attributes such as gene annotations and the dataset of interest has been obtained, this list of geno- associated epigenetic marks. EpiExplorer’s initial overview mic regions can be uploaded into EpiExplorer and interac- screen summarizes the overlap of 5hmC hotspots with the tively explored for hypothesis generation and visual most relevant genomic attributes and provides the starting analysis. The uploaded genomic regions are internally point for interactive exploration of the dataset (Figure 1a). annotated with a wide range of genomic attributes, which This view is tissue-specific, and we select a human ES cell enables visualization, analysis and filtering in real time. line (’H1hESC’) as the tissue type of interest. In ES cells, Five types of genomic regions are available in EpiExplorer we observe striking overlap between 5hmC hotspots and by default, namely CpG islands, gene promoters, transcrip- epigenetic marks associated with distal gene-regulatory tion start sites, predicted enhancer elements and a map of activity. Specifically, more than 80% of 5hmC hotspots 5-kb tiling regions spanning the entire genome. Further- overlap with peaks of the histone H3K4me1 mark, which more, EpiExplorer’s default genomic attribute database is a well-known signature of enhancer elements [36]. In includes chromatin and transcription factor binding data contrast, less than 20% of 5hmC hotspots overlap with his- from the ENCODE

Live Exploration and Global Analysis of Large Epigenomic Datasets Konstantin Halachev1, Hannah Bast2, Felipe Albrecht1, Thomas Lengauer1 and Christoph Bock1,3,4

Trna GENES AFFECT MULTIPLE ASPECTS of LOCAL CHROMOSOME BEHAVIOR

UC San Diego Electronic Theses and Dissertations

Inhibition of Casein Kinase 2 Blocks G2/M Transition in Early Embryo Mitosis

Gene Loci Associated with Insulin Secretion in Islets from Non- Diabetic Mice

Implication of a New Function of Human Tdnas in Chromatin Organization Yuki Iwasaki1,2, Toshimichi Ikemura1, Ken Kurokawa2 & Norihiro Okada1,3*

The Jnks Differentially Regulate RNA Polymerase III Transcription by Coordinately Modulating the Expression of All TFIIIB Subunits

A Single-Stranded Promoter for RNA Polymerase III

Original Article Variants in SNAI1, AMDHD1 and CUBN in Vitamin D

Gtrnadb 2.0: an Expanded Database of Transfer RNA Genes Identified in Complete and Draft Genomes

CK2-Mediated Phosphorylation of General Repressor Maf1 Triggers RNA Polymerase III Activation

Casein Kinase 2 Associates with Initiation-Competent RNA Polymerase I and Has Multiple Roles in Ribosomal DNA Transcription Tatiana B

Genome-Wide Localization of the RNA Polymerase III Transcription Machinery in Human Cells

Live Exploration and Global Analysis of Large Epigenomic Datasets Konstantin Halachev1*, Hannah Bast2, Felipe Albrecht1, Thomas Lengauer1 and Christoph Bock1,3,4*

Live Exploration and Global Analysis of Large Epigenomic Datasets Konstantin Halachev1, Hannah Bast2, Felipe Albrecht1, Thomas Lengauer1 and Christoph Bock1,3,4