ChIP-seq Annotation and Visualization How to add biological meaning to peaks

M. Defrance, C. Herrmann, D. Puthier, M. Thomas-Chollier, S Le Gras, J van Helden Our data in the context

Custom track uploded by the user (here ESR1 peaks in siGATA3 context)

public UCSC annotation/data tracks

Typical questions

- What are the associated to the peaks? ChIP-seq peaks - Are some genomic categories over-represented? - Are some functional categories over-represented? - Are the peaks close to the TSS, …? ChIP-seq peaks

Annotation Visualisation

Enrichment profiles Annotated peaks Genomic & functional

Average Profile near TSS Average Profile near TTS Annotation

Genomic location Relation to CpG island 0.30 1.0 2500

0.8 chr start end

0.25 chr15 65294195 65295186

0.6 chrX 19635923 19638359 Chst7 Average Profile Average Profile Average 3000

chr8 33993863 33995559 1500 0.4

0.20 chr10 114236977 114239326 Trhde # of regions # of regions

0.2 Distribution of Peak Heights chrX 69515082 69516482 Gabre 500 chr4 49857142 49858913 Grin3a 1000 −3000 −2000 −1000 0 1000 2000 3000 −3000 −2000 −1000 0 1000 2000 3000 0 5 10 15 chr16 7352861 7353410 Rbfox1 0 0 Relative Distance to TSS (bp) Relative Distance to TTS (bp) chr7 64764156 64765421 Gabra5 ChIP Regions (Peaks) over Average Gene Profile chrX 83436881 83438330 Nr0b1 CGI Shore Distant chr10 120288598 120289143 Msrb3 Multiple 1 Promoter Intergenic 2 chr5 67446361 67446855 Limch1 Gene Body 3 0.8 4 5 6 7 0.6 8 9 Average Profile Average 10 0.4 11 12 13 0.2 14

15 −1000 0 1000 2000 3000 4000 16 Upstream (bp), 3000 bp of Meta−gene, Downstream (bp) 17 18

19 Average Concatenated Exon Profile Average Concatenated Intron Profile X Y

0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08

1.0 Chromosome Size (bp) 1.0 0.8 0.8 0.6 0.6 Average Profile Average Profile Average 0.4 0.4 0.2 0.2

0 20 40 60 80 100 0 20 40 60 80 100

Relative Location (%) Relative Location (%) Distribution of Peak Heights

0 5 10 15

ChIP Regions (Peaks) over Chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 Chromosome W.Huang et al. 13 Genomic location Relation to CpG island 14

15 (Huang et al., 2011). The default format of input peak data files 2500

16 is the UCSC BED format. PAVIS also supports the GFF3

17 format, and can use peak data files from most ChIP-seq data

18 analysis tools. PAVIS offers two user interfaces for query data: 3000

clear and intuitive. The clear interface is the default interface, and 1500 19 is more concise and clear but offers fewer options. The intuitive X

interface affords a more detailed explanation of each query field, # of regions # of regions Y and provides additional query options such as sequence conser- 500 1000 vation annotation and the output data file format (e.g. MS Excel Downloaded from 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 format). A query peak file, the file to be annotated in genetic 0 0 feature context,Chromosome is loaded Size (bp) through either the clear or intuitive interface. As an option, this query peak file can be compared Fig. 1. A pie chart example from PAVIS annotation report with up to five other peak files. CGI Shore Distant Multiple http://bioinformatics.oxfordjournals.org/ Promoter Intergenic

2.2 Peak annotation Gene Body Annotation of query peaks includes identification of the closest gene to each peak and its relative location: upstream of transcrip- Average Profile near TSS Average Profile near TTS tion start site (TSS), intron, exon, 50/30-untranslated region or downstream of transcription termination site (TTS). When there are multiple nearby genes for a peak, the peak is associated with the genes whose TSS is the closest to the peak if the peak is within the gene region, if not, the peak is associated with the 0.30 1.0 genes with the closest TSS unless the peak is closer to the TTS of another gene. As a part of the annotation, PAVIS can also Fig. 2. Illustration of PAVIS visualization explorer. On the top is the compute the sequence conservation score of each peak region. 0.8 navigation control panel with the legend indicating colors for query and PAVIS annotation report provides summary statistics such as comparison peaks. In the middle is the density overview bar of all query at Université Libre de Bruxelles - Bibliothèques CP 180 on September 29, 2014 the number of annotated peaks and relative enrichment0.25 level in peaks on the chromosome, where the red line segment indicates the dis-

0.6 each genomic feature category (see Supplementary Material for played peaks location. At the bottom are two display windows showing

Average Profile Average the enrichment test). The report also provides download Profile Average links of query peaks #177 and #178 and their comparison peaks, respectively annotated data and displays the relative proportion of peaks in 0.4 each category (Fig. 1). 0.20 by assigning a unique color to each peak dataset. The query

0.2 2.3 Peak visualization peak is always assigned green and other color assignment PAVIS’ visualization interface ‘Visual Locus2 Explorer’ is shown by the legend on the control panel. PAVIS offers options −3000 −2000 launched−1000 directly0 from1000 the2000 PAVIS3000 annotation report. The−3000 inter- −2000in the−1000 control panel0 to1000 change2000 the chromosome3000 and the range face consists of three components: the control panel, the peak interactively, to use non-overlapping style view of peaks and to Relative Distance to TSS (bp) 1 Relative Distance to TTS (bp) density view panel and the peak visualization panelbits consisting of navigate peaks in either the chromosome-wide or within-region multiple display windows (Fig. 2, Supplementary Figs. 2–4).AAmode. A G C TG ATT T A G A GG AT C C T CCA CC Using peak-oriented visualization, PAVIS can simultaneously0 T TT

Average Gene Profile1 2 3 4 5 6 7 8 9 display multiple peaks with their relevant genomic5′ context in 3′ weblogo.berkeley.edu their respective windows. The query peak is automatically cen- ACKNOWLEDGEMENTS tered in its respective display window, which can be zoomed in or The authors thank Xiaojiang Xu at NIEHS Integrative out independent of other display windows. Bioinformatics for helping with software testing, and PAVIS graphically displays peak-relevant genomic context

0.8 Shuangshuang Dai and the NIEHS computational biology facil- including genes, exons, intron, TSS, TTS and transcription ity for their help with the setup of the PAVIS Web site and direction of genes in flanking genomic regions. The zooming support for computational infrastructure. function enables users to get a compact view of genomic context

0.6 in the large surrounding region or to take a detailed close look at Funding: Intramural Research Program of the NIH, the National the nearby region of a peak. Each display window has its own Institute of Environmental Health Sciences (ES101765). reset button that can quickly bring back the display region to the Average Profile Average default 20 kb region after zooming in/out operation. PAVIS also Conflict of Interest: none declared. 0.4 supports the mouseover function to display relevant information such as gene names, peak ID and peak width. Furthermore, PAVIS supports the integration of the UCSC browser, allowing REFERENCES

0.2 viewing the genomic region of the current window in the UCSC Huang,W. et al. (2011) Efficiently identifying genome-wide changes with next- browser. generation sequencing data. Nucleic Acids Res., 39, e130. −1000 By default,0 PAVIS shows the first1000 50 peaks (all peaks if5200050) in Ji,H. et al. (2008)3000 An integrated software system4000 for analyzing ChIP-chip and ChIP- the selected chromosome, and displays peaks in overlapping style seq data. Nat. Biotechnol., 26, 1293–1300. Upstream (bp), 3000 bp of Meta−gene, Downstream (bp)

Average3098 Concatenated Exon Profile Average Concatenated Intron Profile 1.0 1.0 0.8 0.8 0.6 0.6 Average Profile Average Profile Average 0.4 0.4 0.2 0.2

0 20 40 60 80 100 0 20 40 60 80 100

Relative Location (%) Relative Location (%) ChIP-seq peaks (bed, xls, txt file)

MACS peaks in bed format chr1 3001827 3002328 MACS_peak_1 55.28 chr1 3067471 3067948 MACS_peak_2 50.67 Statistical significance chr1 3660316 3662844 MACS_peak_3 352.43 chr1 3842462 3842994 MACS_peak_4 59.21 -10 log(P-value) chr1 3877254 3877710 MACS_peak_5 52.72 chr1 3939314 3939679 MACS_peak_6 82.99

MACS peaks extented format Chr Start End W Summit Tags Sig Fold FDR chr16 35981451 35981951 321 35981701 24 1107.07 30.55 0.0 chr18 30784846 30785346 628 30785096 40 964.91 43.62 0.0 chr14 79381873 79382373 441 79382123 29 939.17 37.2 0.0 chr12 34467249 34467749 1160 34467499 53 928.38 19.93 0.0 chr8 90304944 90305444 1804 90305194 80 883.76 10.21 0.0 chr15 65294343 65294843 992 65294593 62 824.32 13.4 0.0 chr17 48499365 48499865 370 48499615 24 798.58 20.62 0.0 chr18 72429446 72429946 531 72429696 31 790.48 39.77 10.0 chr15 54579253 54579753 487 54579503 29 781.63 32.15 9.09 chr13 56988583 56989083 916 56988833 60 777.7 9.44 8.33 ChIP-seq profiles (wig, wig.gz, bigWig)

wig generated by MACS track type=wiggle_0 name="ChIP-H3K4-1_treat_all" description="Extended tag pileup from MACS version 1.4.1 for every 40 bp" variableStep chrom=chr1 span=40 3000361 2 3000401 2 3000441 2 3000481 4 3000521 4 3000561 2 3000601 2 3000641 2 3001841 5 3001881 5 3001921 7 3001961 9 3002001 9 3002041 6 3002081 6 3002121 4 bigWig (converted from wig or bam) indexed binary format Profile around the TSS Peak distance to TSS distribution using profile in wig using peaks in bed

Average Profile near TSS Average Profile near TTS 0.30 1.0 0.8 0.25 0.6 Average Profile Average Profile Average 0.4 0.20 0.2

−3000 −2000 −1000 0 1000 2000 3000 −3000 −2000 −1000 0 1000 2000 3000

Relative Distance to TSS (bp) Relative Distance to TTS (bp)

Average Gene Profile 0.8 0.6 Average Profile Average 0.4 0.2

−1000 0 1000 2000 3000 4000

Upstream (bp), 3000 bp of Meta−gene, Downstream (bp)

Average Concatenated Exon Profile Average Concatenated Intron Profile 1.0 1.0 0.8 0.8 0.6 0.6 Average Profile Average Profile Average 0.4 0.4 0.2 0.2

0 20 40 60 80 100 0 20 40 60 80 100

Relative Location (%) Relative Location (%) Average Profile near TSS Average Profile near TTS 2.5 0.8

2.0 Profile upstream and downstream TSS 0.7 1.5 0.6 Average Profile Average Profile Average 1.0 0.5 0.5 0.4 −3000 −2000 −1000 0 1000 2000 3000 −3000 −2000 −1000 0 1000 2000 3000

PromoterRelative Distance to TSS (bp) Gene Relative Distance to TTS (bp)

Average Gene Profile 2.0 1.5 Average Profile Average 1.0 0.5

−1000 0 1000 2000 3000 4000

Upstream (bp), 3000 bp of Meta−gene, Downstream (bp)

Average Concatenated Exon Profile Average Concatenated Intron Profile 2.5 2.5 2.0 2.0 1.5 1.5 Average Profile Average Profile Average 1.0 1.0 0.5 0.5

0 20 40 60 80 100 0 20 40 60 80 100

Relative Location (%) Relative Location (%) OUTPUT: peakdistancetoTSSdistribution(densityplot) INPUT: bedfilewithpeaks Galaxy: MakeTSSdist Proportion of genes with a peak Proportion of genes with a peak at a given distance (density) at a given distance (cumulative)

0.0e+00 6.0e−07 1.2e−06 0.0 0.2 0.4 0.6 − 0 10 − 1 8 − 2 6 Distance fromTSS(Kb) Distance fromTSS(Kb) − 3 4 − 4 2 5 0 ChIP 6 2 7 4 ChIP 6 8 8 9 10 10 Practice Galaxy: AnnotatePeaks Practice

INPUT: bed file with peaks OUTPUT: annotated peaks + distribution per category 0.30 0.15 0.20 0.10 0.10 Proportion of peaks Proportion of peaks 0.05 0.00 0.00 GeneDown. Enh. Imm.Down. Interg. Intrag. Prom. exon f_exon f_intron intron jonction Vol. 29 no. 23 2013, pages 3097–3099 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btt520

Sequence analysis Advance Access publication September 4, 2013 PAVIS: a tool for Peak Annotation and Visualization 1, 2, , 1, ,§ 2 PAVIS Weichun Huang y, Rasiah Loganantharaj y z, Bryce Schroeder y , David Fargo and Leping Li1,* 1Biostatistics Branch and 2the Integrative Bioinformatics Group, National Institute of Environmental Health Sciences, Downloaded from Durham, NC 27709, USA Annotation and visualisation Associate Editor: Martin Bishop

ABSTRACT DNA element are often dependent on its position relative to

Summary: We introduce a web-based tool, Peak Annotation and nearby genes or other functional elements. It can be a challenging http://bioinformatics.oxfordjournals.org/ Visualization (PAVIS), for annotating and visualizing ChIP-seq peak and time-consuming task to examine all peaks, and to develop data. PAVIS is designed with non-bioinformaticians in mind and pre- meaningful biological interpretations of their functional rele- sents a straightforward user interface to facilitate biological interpret- vance. Motivated by this challenge, we developed the PAVIS ation of ChIP-seq peak or other genomic enrichment data. PAVIS, tool to facilitate data comparison, interpretation and hypothesis through association with annotation, provides relevant genomic con- generation from ChIP-seq peak data. text for each peak, such as peak location relative to genomic features There are several existing software tools that can be used for

including transcription start site, intron, exon or 50/30-untranslated annotation and visualization of ChIP-seq peaks, e.g. CisGenome region. PAVIS reports the relative enrichment P-values of peaks in (Ji et al., 2008) and PeakAnalyzer/PeakAnnotator (Salmon- these functionally distinct categories, and provides a summary plot Divon et al., 2010). All of these tools are useful for ChIP-seq of the relative proportion of peaks in each category. PAVIS, unlike peak data analysis and interpretation, but they each have limi- many other resources, provides a peak-oriented annotation and visu- tations. PeakAnalyzer/PeakAnnotator is a highly efficient tool

alization system, allowing dynamic visualization of tens to hundreds of with utilities not only for functional annotation of peaks, but at Université Libre de Bruxelles - Bibliothèques CP 180 on September 29, 2014 loci from one or more ChIP-seq experiments, simultaneously. PAVIS also for sequence extraction supporting motif analysis. enables rapid, and easy examination and cross-comparison of the However, it does not provide peak visualization capability. genomic context and potential functions of the underlying genomic CisGenome provides capabilities similar to PeakAnalyzer. elements, thus supporting downstream hypothesis generation. CisGenome, however, does not support the annotation of Availability and Implementation: PAVIS is publicly accessed at many existing ChIP-seq peak files including many of those http://manticore.niehs.nih.gov/pavis. from the ENCODE project (Thomas et al., 2007). Both the Contact: [email protected] UCSC genome browser (Kent et al., 2002) and the Integrative Supplementary information: Supplementary data are available at Genomics Viewer (Thorvaldsdottir et al., 2012) are powerful Bioinformatics online tools and have more capabilities, displaying multiple relevant data in different tracks. Their visualization capabilities, however, Received on April 11, 2013; revised on August 30, 2013; accepted on are designed to display maximal information for a single genomic September 2, 2013 region one at a time. As a complimentary tool, PAVIS is de- signed to display multiple loci simultaneously with each locus linked to the UCSC browser for additional information. 1 INTRODUCTION Furthermore, PAVIS can be used to annotate ChIP-seq peaks The advance of sequencing technology and its exponential in the context of nearby gene features. declining cost have made it possible to conduct many genome- wide studies to understand complex biological processes such as gene regulation, epigenomic changes and chromatin remodeling. 2 FEATURES AND METHODS These studies have generated a large number of ChIP-seq data- PAVIS is designed and developed for ease of use for biologists or sets that are subsequently parsed by peak calling programs bench scientists. PAVIS’ back-end engine was written in Python (Salmon-Divon et al., 2010) to report statistically significant for ease of maintenance and its web-based interface and visual- peaks. For a genome-wide study, the number of significant ization explorer were implemented with HTML5 and Ajax peaks can be tens to hundreds of thousands. The biological rele- JavaScript codes. PAVIS currently supports genomic annotation vance of a ChIP-seq peak and the functions of its underlying of several species including human, mouse, rat, worm and yeast. All genomic annotation data and alignment data were from the *To whom correspondence should be addressed. UCSC genome browser. PAVIS offers two primary functions: yThe authors wish it to be known that, in their opinion, the first three peak data annotation and peak visualization within relevant authors should be regarded as joint First Authors. genomic context. zPresent address: The Center for Advanced Computer Studies, University of Louisiana at Lafayette, LA 70504, USA PAVIS takes as the input the peak location data generated §Present address: School of Medicine, the Stony Brook University, NY by peak-calling tools, e.g. MACS (Zhang et al., 2008), or 11794, USA other more general ChIP-seq data analysis tool, e.g. EpiCenter

http://manticore.niehs.nih.gov:8080/pavis/Published by Oxford University Press 2013. This work is written by US Government employees and is in the public domain in the US. 3097 PAVIS

Output Example PAVIS

Detailed view

Chromosome Loci Start Loci End Gene ID Gene Symbol Strand Distance to TSS chr13 022690027 022690527 NM_000231 SGCG + +37218 chr13 023047991 023048491 NM_148957 TNFRSF19 + +5733 chr13 023359572 023360072 NM_005932 MIPEP - +1765 chr13 023634753 023635253 NR_031753 MIR2276 + +0449 chr13 024956993 024957493 NM_016529 ATP8A2 + +113035 chr13 025197768 025198268 NM_016529 ATP8A2 + +353810 chr13 025317576 025318076 NM_016529 ATP8A2 + +473618 PAVIS Optional practice

INPUT: peaks OUTPUT: annotated peaks + figures

Chromosome Loci Start Loci End Gene ID Gene Symbol Strand Distance to TSS chr13 022690027 022690527 NM_000231 SGCG + +37218 chr13 023047991 023048491 NM_148957 TNFRSF19 + +5733 chr13 023359572 023360072 NM_005932 MIPEP - +1765 chr13 023634753 023635253 NR_031753 MIR2276 + +0449 chr13 024956993 024957493 NM_016529 ATP8A2 + +113035 chr13 025197768 025198268 NM_016529 ATP8A2 + +353810 chr13 025317576 025318076 NM_016529 ATP8A2 + +473618 Nucleic Acids Research Advance Access published May 5, 2014 Nucleic Acids Research, 2014 1 doi: 10.1093/nar/gku365 deepTools: a flexible platform for exploring deepTools deep-sequencing data

1, 1,2, 1 3 Fidel Ram´ırez †,FriederikeDundar¨ †, Sarah Diehl ,Bjorn¨ A. Gruning¨ and Thomas Manke1,*

1Max Planck Institute of Immunobiology and Epigenetics, Stubeweg¨ 51, 79108 Freiburg, Germany, 2Faculty of Biology, University of Freiburg, Schanzlestraße¨ 1, 79104 Freiburg, Germany and 3Department of Computer Science, University of Freiburg, Georges-Kohler-Allee¨ 106, 79110 Freiburg, Germany

Received February 4, 2014; Revised April 5, 2014; Accepted April 15, 2014 Downloaded from

ABSTRACT preting NGS sequencing data. To add to the burden, re- searchers are routinely asked to compare their novel ex- We present a Galaxy based web server for process- perimental results with sequencing data deposited in pub- http://nar.oxfordjournals.org/ ing and visualizing deeply sequenced data. The web lic repositories like the Sequence Read Archive (SRA, http: server’s core functionality consists of a suite of //www.ncbi.nlm.nih.gov/sra) and the European Nucleotide newly developed tools, called deepTools, that enable Archive (ENA, www.ebi.ac.uk/ena). In the past years, many users with little bioinformatic background to explore programs have been developed for NGS data processing. the results of their sequencing experiments in a stan- The vast majority of these tools, however, require experience dardized setting. Users can upload pre-processed with the command-line and often do not provide graphi- files with continuous data in standard formats and cal outputs to guide the interpretation of the results. Ad- ditionally, NGS data may suffer from several biases that

generate heatmaps and summary plots in a straight- at Université Libre de Bruxelles - Bibliothèques CP 180 on October 2, 2014 forward, yet highly customizable manner. In addi- should be taken into consideration for down-stream anal- yses. In our experience, the lack of user-friendly software tion, we offer several tools for the analysis of files along with comprehensive documentation and explanations containing aligned reads and enable efficient and re- of the multiple steps frequently deter biologists from tak- producible generation of normalized coverage files. ing part in the processing and analysis of their own data. As a modular and open-source platform, deepTools To address these challenges, we have developed and refned can easily be expanded and customized to future de- a set of tools, called deepTools, that enable researchers to mands and developments. The deepTools webserver manage, manipulate and most importantly, explore their is freely available at http://deeptools.ie-freiburg.mpg. NGS data. Our tools are incorporated into the Galaxy de and is accompanied by extensive documentation framework, one of the most popular analysis platforms for and tutorials aimed at conveying the principles of NGS data, that offers easy and intuitive access to numer- deep-sequencing data analysis. The web server can ous bioinformatic applications and strongly supports doc- be used without registration. deepTools can be in- umentation and reproducibility of analysis steps (1). deep- Tools provides standardized diagnostic plots for aligned stalled locally either stand-alone or as part of Galaxy. reads, various normalization strategies, extensive support for format conversion and a set of tools for highly customiz- INTRODUCTION able meta-analyses and visualizations, such as heatmaps and summary plots. The utilities are easy-to-use as our web As high-throughput sequencing technologies (also: next- server handles the computational complexity and users will generation sequencing, NGS) continue to become cheaper, encounter the familiar Galaxy environment. The underly- faster and more reliable, they are being adapted to ad- ing software has been optimized for effciency and highly dress a wide spectrum of biological questions, ranging from parallelized processing, making the tools suitable for rou- transcriptome assessments (RNA-seq) to –DNA in- tine analysis of large-scale data. Apart from publication- teractions (ChIP-seq), epigenetic marks (ChIP-seq, BS- ready images, users can export standardized output fles seq) and the 3D-structure of the genome (4C, 5C, ChIA- that comply with the formats established by big sequenc- PET, Hi-C). This has led to a widespread adoption of the ing consortia (BAM, bigWig, bedGraph, BED). This en- technology in many laboratories that are now facing the sures compatibility with other Galaxy workfows and ex- formidable challenge of processing, analyzing and inter-

*To whom correspondence should be addressed. Tel: +49 0 761 5108 738; Fax: +49 0 761 5108 80738; Email: [email protected] †These authors contributed equally to the work.

C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. ⃝ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] TSS deepTools: heatmapper Practice

INPUT: ChIP bigWig + bed of feature OUTPUT: heatmap UCSC Genes ANA LYSIS

GREAT improves functional interpretation of GREAT cis-regulatory regions

Cory Y McLean1, Dave Bristor1,2, Michael Hiller2, Shoa L Clarke3, Bruce T Schaar2, Craig B Lowe4, Aaron M Wenger1 & Gill Bejerano1,2 Functional annotation of cis-regulatory regions We developed the Genomic Regions Enrichment of Annotations genes—introduces a strong bias toward genes that are flanked by large Tool (GREAT) to analyze the functional significance of cis- intergenic regions12,13. For example, though the Gene Ontology14 regulatory regions identified by localized measurements of DNA (GO) term ‘multicellular organismal development’ is associated with binding events across an entire genome. Whereas previous 14% of human genes, the ‘nearest genes’ approach associates over methods took into account only binding proximal to genes, 33% of the genome with these genes. This biological bias results in GREAT is able to properly incorporate distal binding sites numerous false positive enrichments, particularly for the input set and control for false positives using a binomial test over the sizes typical of a ChIP-seq experiment (Fig. 2b and Supplementary input genomic regions. GREAT incorporates annotations from Fig. 1). Building on our experience in addressing these pitfalls12,15,16, 20 ontologies and is available as a web application. Applying we have developed a tool that robustly integrates distal binding events GREAT to data sets from chromatin immunoprecipitation while eliminating the bias that leads to false positive enrichments. coupled with massively parallel sequencing (ChIP-seq) of multiple transcription-associated factors, including SRF, RESULTS NRSF, GABP, Stat3 and p300 in different developmental Here we describe GREAT, which analyzes the functional significance of contexts, we recover many functions of these factors that are sets of cis-regulatory regions by explicitly modeling the vertebrate genome missed by existing gene-based tools, and we generate testable regulatory landscape and using many rich information sources. hypotheses. The utility of GREAT is not limited to ChIP-seq, ChIP-seq peaks as itOntology could also be applied to termsopen chromatin, localized A binomial test for long-range gene regulatory domains epigenomic markers and similar functional data sets, as well GREAT associates genomic regions with genes by defining a ‘regu- as comparative genomics sets. latory domain’ for each gene in the genome. Each genomic region is associated with all genes in whose regulatory domains it lies The coupling of chromatin immunoprecipitation with massively par- (Fig. 1b). High-throughput chromosomal conformation capture allel sequencing, ChIP-seq, is ushering in a new era of genome-wide (3C) approaches such as 5C (ref. 17), Hi-C (ref. 18) or enhanced GOfunctional Molecular analysis1–3. Thus far, computational Function efforts have focused ChIP-4C (ref. 19) are providing first glimpses of actual gene regula- on pinpointing the genomic locations of binding events from the tory domains. Because we still lack precise empirical maps, however, GOdeluge Biological of reads produced by deep Process sequencing4–8. Functional inter- GREAT assigns each gene a regulatory domain consisting of a basal pretation is then performed using gene-based tools developed in the domain that extends 5 kb upstream and 1 kb downstream from its wake of the preceding microarray revolution9–11. In a typical analysis, transcription start site (denoted below as 5+1 kb), and an extension Diseaseone compares the totalOntology fraction of genes annotated for a given ontol- up to the basal regulatory domain of the nearest upstream and down- ogy term with the fraction of annotated genes picked by proximal stream genes within 1 Mb (GREAT allows the user to modify the rule Pathwaybinding events to obtains a gene-based P value for enrichment (Fig. 1 and distances). GREAT further refines the regulatory domains of a Functional annotationand Online Methods). handful of genes, including several global control regions20, by using This procedure has a fundamental drawback: associating only pro- their experimentally determined regulatory domains. Our tool can …ximal binding events (for example, under 2–5 kb from the transcrip- also incorporate additional locus-based and genome-wide data as they of ChIP-peaks tion start site) typically discards over half of the observed binding become available (Supplementary Fig. 2 and Online Methods). events (Fig. 2a). However, the standard approach to capturing distal Given a set of input genomic regions and an ontology of gene events—associating each binding site with the one or two nearest annotations, GREAT computes ontology term enrichments using a binomial test that explicitly accounts for variability in gene regulatory domain size by measuring the total fraction of the genome annotated 1 2 Department of Computer Science, Department of Developmental Biology and for any given ontology term and counting how many input genomic 3Department of Genetics, Stanford University, Stanford, California, USA. 4Center for Biomolecular Science and Engineering, University of California regions fall into those areas (Fig. 1b and Online Methods). In the Santa Cruz, Santa Cruz, California, USA. Correspondence should be addressed to example above, GREAT expects 33% of all input elements to be asso- G.B. ([email protected]). ciated with ‘multicellular organismal development’ by chance, rather Published online 2 May 2010; doi:10.1038/nbt.1630 than the 14% of input elements that a gene-based test assumes. The

NATURE BIOTECHNOLOGY VOLUME 28 NUMBER 5 MAY 2010 495

how do we make biological sense

out of the peaks ? GREAT

Note: Only human ( hg19 and hg18), mouse (mm9) and zebrafish (danRer7) genomes are supported GREAT

A N A LYSIS GREAT a b c 10

SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat) Stat3 (M: ESC) B H Binomial test over Hypergeometric Hypergeometric testgenomic over regions genestest over genes 8 p300 (M: ESC) p300 (M: limb) p300 (M: forebrain) p300 (M: midbrain) 0.7 60 b10:h3 b7:h1 b3:h2 value) H \ B P * * 0.6 6 h5 * genes50 with term A + b8:h4 h9 h8+ h7+ b9:h6* + h10 0.5 + 40 B \ H 0.4 4 30 b1 0.3 ×

20 –log(hypergeometric 2 b5 0.2 × b2 × ×

Fraction of all elements Fraction × b6 b4 0.1 10 False positive enriched terms positive False 0 0 0 0–2 2–5 5–50 50–500 > 500 0.1 0.5 1 5 10 50 100 0 2 4 6 8 10 Distance to nearest transcriptiongenes start site (kb) with peaks Input set size (thousands) –log(binomial P value) Figure 2 Binding profiles and their effects on statistical tests. (Genesa) ChIP-seq data sets → of several Regions regulatory show ←that the Peaks majority of binding events lie well outside the proximal promoter, both for sequence-specific transcription factors (SRF and NRSF, ref. 8; Stat3, ref. 43) and a general enhancer-associated protein (p300, refs. 33,43). Cell type is given in parentheses: H, human; M, mouse. (bBinomial) When not restricted test toover proximal regions promoters, the gene-based hypergeometric test (red) generates false positive enriched terms, especially at the size range of 1,000–50,000 input regions typical of a ChIP-seq set. Negligible false positive enrichment was observed for the region-based binomial test (blue). For each set size, we generated 1,000 random input sets in which each in the was equally likely to be included in each set, avoiding assembly gaps. We calculated all GO term enrichments for both hypergeometric and binomial tests using GREAT’s 5+1 kb basal promoter and up to 1 Mb extension association rule (see Results). Plotted is the average number of terms artificially significant at a threshold of 0.05 after application of the conservative Bonferroni correction. (c) GO enrichment P values using the genomic region-based binomial (x axis) and gene-based hypergeometric (y axis) tests on the SRF data8 with GREAT’s 5+1 kb basal promoter and upterm to 1 Mb A extension association rule (see Results). b1 through b10 denote the top ten most enriched terms when we used the binomial test. h1 throughterm h10 denote B the top ten most enriched terms when we used the hypergeometric test. Terms significant by both tests (B H) provide specific and accurate annotations supported by multiple genes and binding events (Table 3). Terms significant by only the hypergeometric test (H\B) are general and often associated with genes of large regulatory domains, whereas terms significant by only the binomial test (B\H) cluster four to six genomic regions near only one or two genes annotated with the term (Supplementary Table 46). Given that 60% of the genome is ChIP-seq and mapped to the genome using the quantitative enrich- shows that many genes proximal to SRF binding events (inp Jurkat > 0.5 cells) ment of sequence tags (QuEST) ChIP-seq peak-callingannotated tool 8to. This A, data would are alsoI randomly proximal to expectYY1 binding events (in HeLa cells), consistent set’s authors applied existing gene-based enrichment tools, which did with experiments showing that SRF acts in conjunction with YY1 to not discern specific functions of SRF from the3 or set moreof regions peaks it binds to8, fallregulate into Fos region (ref. 26). TheA ? top six hits in the Predicted Promoter Motifs and concluded that SRF is a regulator of basic cellular processes ontology27 are all variants of the SRF motif generated from different with no specific physiological roles (results reproduced in Table 2). experiments and thus serve as strong positive controls of our method. Although SRF is indeed a regulator of basic cellularGiven functions, that 15%numerous of theUsing genome the Pathway is Commons ontology28, GREAT predicts that SRF studies have implicated SRF in more specific biological contexts. SRF regulates components of the TRAIL signaling pathway and the class I is a key regulator of the Fos oncogene22 andannotated has also been todescribed B, would PI3K I signalingrandomly pathway. expect Previous experimental work has demonstratedp = 0.07 as a “master regulator of actin cytoskeleton”32 3or. Neither more FOS peaks nor actin to fallthat intothere isregion an association B ? between SRF and TRAIL signaling29 and appeared in the top ten hypotheses generated by the previous study that SRF is needed for PI3K-dependent cell proliferation30. (Table 2). The same was true when we used GREAT with only pro- In addition to rediscovering and expanding specific known func- ximal (2 kb) associations (Supplementary"GREAT Table improves 6). functional interpretationtions of SRF, of GREAT cis-regulatory produces regions" testable hypotheses even for this well- However, GREAT analysis of the mostMcLean significant et al. Nat. SRF Biotech. ChIP-seq (2010) studied transcription factor. The Transcription Factor Targets ontology peaks8 (QuEST score > 1; n = 556) using the default settings (5+1 kb indicates that SRF binds near genes regulated by E2F4 (in T98G, U2OS basal, up to 1 Mb extension) prominently highlights the key obser- and WI-38 cells; Table 3). SRF and E2F4 have not been shown to co- vation that gene-based analyses were unable to reveal: SRF regulates regulate target genes; however, both SRF and E2F4 are known to inter- genes associated with the actin cytoskeleton23 (Table 3). As postulated act with Smad3 (refs. 31,32), and they may thus be co-regulators of a above, using both binomial and hypergeometric enrichment tests does common set of genes. The Predicted Promoter Motifs ontology reveals highlight informative GO terms more effectively than using either additional potential cofactors and co-regulators. It is particularly test alone (Fig. 2c and Supplementary Table 46). Moreover, when useful given that many more genes have characterized binding motifs extension of regulatory domains is limited to 50 kb, one-third of the than have genome-wide ChIP data available. In this case, it shows enrich- supporting regions and associated genes are lost, and actin-related ment for SRF binding near genes containing GABP motifs in their pro- terms drop in rank (Supplementary Table 7). moters. Notably, an independent experiment measuring GABP-bound Coupling distal (up to 1 Mb) associations with the many additional regions of the genome in Jurkat cells has found that 29% of SRF peaks ontologies available within GREAT provides a wealth of enrichments occur within 100 bp of a GABP peak, suggesting that SRF and GABP for specific known functions of SRF. An enrichment analysis of TreeFam may indeed work together8. We were able to generate this same hypo- gene families24 shows that SRF binds in proximity to five of six mem- thesis using GREAT, without observing the GABP ChIP-seq data. bers of the FOS family. Two genes within the Fos family, Fos and Fosb, are previously known targets of SRF (ref. 22). The Transcription Factor P300 binding in the developing mouse limbs Targets ontology 25 has compiled data from ChIP experiments that link Second, we analyzed a recent ChIP-seq data set comprising 2,105 transcription factor regulators to downstream target genes. GREAT regions of the mouse genome bound by the general enhancer-associated

NATURE BIOTECHNOLOGY VOLUME 28 NUMBER 5 MAY 2010 497 GREAT Practice

INPUT: bed file with peaks OUTPUT: Enriched GO terms and functions

Giannopoulou and Elemento BMC Bioinformatics 2011, 12:277 http://www.biomedcentral.com/1471-2105/12/277

SOFTWARE Open Access An integrated ChIP-seq analysis platform with ChIPseeqer customizable workflows W.Huang et al. Eugenia G Giannopoulou1,2 and Olivier Elemento1,2*

Abstract A comprehensive framework for the analysisBackground: of ChIP-seqChromatin immunoprecipitation data followed by next generation sequencing (ChIP-seq), enables unbiased and genome-wide mapping of protein-DNA interactions and epigenetic marks. The first step in ChIP-seq (Huang et alW.Huang., 2011). et The al. default format of input peak data files data analysis involves the identification of peaks (i.e., genomic locations with high density of mapped sequence reads). The next step consists of interpreting the biological meaning of the peaks through their association with is the UCSC BED format. PAVIS also supports the GFF3 known genes, pathways, regulatory elements, and integration with other experiments. Although several programs have been published for the analysis of ChIP-seq data, they often focus on the peak detection step and are usually format, and can use peak data files from most ChIP-seq data not well suited for thorough, integrative analysis of the detected peaks. analysis tools.(Huang PAVISet al., offers 2011). two The default user interfaces format of inputfor query peak data data: files Results: To address the peak interpretation challenge, we have developed ChIPseeqer, an integrative, is the UCSC BED format. PAVIS also supports the GFF3 comprehensive, fast and user-friendly computational framework for in-depth analysis of ChIP-seq datasets. The clear and intuitive. The clear interface is the default interface, and novelty of our approach is the capability to combine several computational tools in order to create easily format, and can use peak data files from most ChIP-seq data customized workflows that can be adapted to the user’s needs and objectives. In this paper, we describe the main is more concise and clear but offers fewer options. The intuitive components of the ChIPseeqer framework, and also demonstrate the utility and diversity of the analyses offered, analysis tools. PAVIS offers two user interfaces for query data: by analyzing a published ChIP-seq dataset. interface affords a more detailed explanation of each query field, Conclusions: ChIPseeqer facilitates ChIP-seq data analysis by offering a flexible and powerful set of computational clear and intuitive. The clear interface is the default interface, and tools that can be used in combination with one another. The framework is freely available as a user-friendly GUI and providesis more additional concise query and clear options but offers such fewer as sequence options. The conser-intuitive application, but all programs are also executable from the command line, thus providing flexibility and automatability for advanced users. vation annotationinterface and affords the a output more detailed data file explanation format (e.g. of each MS query Excel field, Downloaded from format). Aand query provides peak additional file, the query file to options be annotated such as sequence in genetic conser- Background introduction of the technology [2,3,9,10]. Although peak The use of chromatin immunoprecipitation in combina- detection is important for the analysis of ChIP-seq data, it vation annotation and the output data file format (e.g. MS Excel Downloaded from feature context, is loaded through either the clear or intuitive tion with high-throughput sequencing (ChIP-seq) has is only the first step. Additional computational tools are format). A query peak file, the file to be annotated in genetic enabled the study of genome-wide mapping of protein- needed to help interpret the genome-wide transcription interface. As an option, this query peak file can be compared Fig. 1. A pie chart example from PAVISDNA interaction annotation and epigenetic report marks. By sequencing factor binding and histone mark enrichment patterns feature context, is loaded through either the clear or intuitive millions of immunoprecipitated DNA fragments in a sin- revealed by peak detection procedures. We have developed with up to five other peak files. gle experiment, ChIP-seq outperforms the array-based ChIPseeqer, a comprehensive computational framework interface. As an option, this query peak file can be compared Fig. 1. A pie chart example from PAVISChIP-chip annotation (Chromatin report Immunoprecipitation followed by that enables broad, but also in-depth, analysis of ChIP-seq with up to five other peak files. DNA microarray hybridization) technology in terms of data. The framework includes: (1) gene-level annotation of peaks, (2) pathways enrichment analysis, (3) regulatory

quality, specificity, and coverage [1-3], and has the poten- http://bioinformatics.oxfordjournals.org/ tial to greatly improve our understanding of the mechan- element analysis, using either a de novo approach, known

2.2 Peak annotation isms underlying transcriptional regulation [4-8]. Many or user-definedhttp://bioinformatics.oxfordjournals.org/ motifs, (4) nongenic peak annotation 2.2 Peak annotation peak detection methodologies and software tools have (repeats, CpG islands, duplications, published ChIP-seq Annotation of query peaks includes identification of the closest been developed for the analysis of ChIP-seq data since the datasets), (5) conservation analysis, (6) clustering analysis, Annotation of query peaks includes identification of the closest (7) visualization tools, (8) integration and comparison across different ChIP-seq experiments. These components gene to each peak and its relative location: upstream of transcrip- * Correspondence: [email protected] gene to each peak and its relative location: upstream of transcrip- 1HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for share a common architecture: they take as input a set of tion start site (TSS), intron, exon, 50/30-untranslated region or Computational Biomedicine, Weill Cornell Medical College, 1305 York ChIP-seq peaks, perform the defined analysis, and output tion start site (TSS), intron, exon, 50/30-untranslated region or Avenue, New York, NY, 10021, USA one or more sets of peaks that can be used by any of the downstreamdownstream of transcription of transcription termination termination site (TTS). site (TTS). When When there there Full list of author information is available at the end of the article ©2011GiannopoulouandElemento;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsofthe Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and are multipleare nearby multiple genes nearby for genes a peak, for a the peak, peak the ispeak associated is associated with with reproduction in any medium, provided the original work is properly cited. the genes whosethe genes TSS whose is the TSS closest is the closest to the to peak the peak if the if the peak peak is is within the genewithin region, the gene if region, not, the if not, peak the is peak associated is associated with with the the genes with thegenes closest with the TSS closest unless TSS the unless peak the is peak closer is closer to the to the TTS TTS of another gene. As a part of the annotation, PAVIS can also Fig. 2. Illustration of PAVIS visualization explorer. On the top is the of anothercompute gene. As the a sequence part of theconservation annotation, score PAVIS of each peak can also region. Fig.navigation 2. Illustration control panel of PAVIS with the legend visualization indicating explorer. colors for On query the and top is the compute the sequence conservation score of each peak region. at Université Libre de Bruxelles - Bibliothèques CP 180 on September 29, 2014 PAVIS annotation report provides summary statistics such as navigationcomparison control peaks. In panel the middle with isthe the legend density indicating overview bar colors of all forquery query and PAVIS annotationthe number report of annotated provides peaks summary and relative statistics enrichment such level as in comparisonpeaks on the peaks. chromosome, In the where middle the is red the line density segment overview indicates bar the dis- of all query at Université Libre de Bruxelles - Bibliothèques CP 180 on September 29, 2014 each genomic feature category (see Supplementary Material for the number of annotated peaks and relative enrichment level in peaksplayed on peaks the location. chromosome, At the bottom where theare two red display line segment windows indicates showing the dis- the enrichment test). The report also provides download links of query peaks #177 and #178 and their comparison peaks, respectively each genomic feature category (see Supplementary Material for played peaks location. At the bottom are two display windows showing annotated data and displays the relative proportion of peaks in the enrichment test). The report also provides download links of each category (Fig. 1). query peaks #177 and #178 and their comparison peaks, respectively annotated data and displays the relative proportion of peaks in by assigning a unique color to each peak dataset. The query each category2.3 (Fig. Peak 1). visualization peak is always assigned green and other color assignment PAVIS’ visualization interface ‘Visual Locus Explorer’ is byshown assigning by the legend a unique on the color control to panel. each PAVIS peak offers dataset. options The query 2.3 Peaklaunched visualization directly from the PAVIS annotation report. The inter- peakin the is control always panel assigned to change green the chromosome and other and color the range assignment face consists of three components: the control panel, the peak showninteractively, by the to legend use non-overlapping on the control style panel. view ofPAVIS peaks offers and to options PAVIS’ visualization interface ‘Visual Locus Explorer’ is navigate peaks in either the chromosome-wide or within-region density view panel and the peak visualization panel consisting of in the control panel to change the chromosome and the range launched directlymultiple from display the windows PAVIS annotation (Fig. 2, Supplementary report. The Figs. inter- 2–4). mode. face consistsUsing of three peak-oriented components: visualization, the control PAVIS panel, can simultaneously the peak interactively, to use non-overlapping style view of peaks and to density viewdisplay panel multiple and the peaks peak with visualization their relevant panel genomic consisting context of in navigate peaks in either the chromosome-wide or within-region multiple displaytheir respective windows windows. (Fig. 2, The Supplementary query peak is automatically Figs. 2–4). cen- mode.ACKNOWLEDGEMENTS tered in its respective display window, which can be zoomed in or Using peak-oriented visualization, PAVIS can simultaneously The authors thank Xiaojiang Xu at NIEHS Integrative out independent of other display windows. display multiple peaks with their relevant genomic context in Bioinformatics for helping with software testing, and PAVIS graphically displays peak-relevant genomic context Shuangshuang Dai and the NIEHS computational biology facil- their respectiveincluding windows. genes, The exons, query intron, peak TSS, is automatically TTS and transcription cen- ACKNOWLEDGEMENTS ity for their help with the setup of the PAVIS Web site and tered in its respectivedirection of display genes in window, flanking which genomic can regions. be zoomed The zooming in or Thesupport authors for computational thank Xiaojiang infrastructure. Xu at NIEHS Integrative out independentfunction of enables other displayusers to get windows. a compact view of genomic context Bioinformatics for helping with software testing, and PAVIS graphicallyin the large surrounding displays regionpeak-relevant or to take a genomic detailed close context look at Funding: Intramural Research Program of the NIH, the National Shuangshuang Dai and the NIEHS computational biology facil- including genes,the nearby exons, region intron, of a peak. TSS, Each TTS display and window transcription has its own Institute of Environmental Health Sciences (ES101765). ity for their help with the setup of the PAVIS Web site and direction ofreset genes button in that flanking can quickly genomic bring back regions. the display The regionzooming to the default 20 kb region after zooming in/out operation. PAVIS also supportConflict for of Interest computational: none declared. infrastructure. function enablessupports users the mouseoverto get a compact function toview display of genomic relevant information context in the largesuch surrounding as gene names, region peak or to ID take and a detailedpeak width. close Furthermore, look at Funding: Intramural Research Program of the NIH, the National the nearbyPAVIS region supports of a peak. the integration Each display of the window UCSC browser, has its allowing own InstituteREFERENCES of Environmental Health Sciences (ES101765). viewing the genomic region of the current window in the UCSC reset button that can quickly bring back the display region to the Huang,W. et al. (2011) Efficiently identifying genome-wide changes with next- browser. default 20 kb region after zooming in/out operation. PAVIS also Conflictgeneration of Interestsequencing: data. noneNucleic declared. Acids Res., 39, e130. supports the mouseoverBy default, PAVIS function shows to the display first 50 relevant peaks (all information peaks if550) in Ji,H. et al. (2008) An integrated software system for analyzing ChIP-chip and ChIP- the selected chromosome, and displays peaks in overlapping style seq data. Nat. Biotechnol., 26, 1293–1300. such as gene names, peak ID and peak width. Furthermore, PAVIS supports the integration of the UCSC browser, allowing REFERENCES viewing the3098 genomic region of the current window in the UCSC Huang,W. et al. (2011) Efficiently identifying genome-wide changes with next- browser. generation sequencing data. Nucleic Acids Res., 39, e130. By default, PAVIS shows the first 50 peaks (all peaks if550) in Ji,H. et al. (2008) An integrated software system for analyzing ChIP-chip and ChIP- the selected chromosome, and displays peaks in overlapping style seq data. Nat. Biotechnol., 26, 1293–1300.

3098 Average Profile near TSS Average Profile near TTS

CEAS (Cis-regulatory Element Annotation System) 0.30 1.0 0.8 0.25 0.6

http://liulab.dfci.harvard.edu/CEAS/ Profile Average Profile Average 0.4 0.20 0.2

−3000 −2000 −1000 0 1000 2000 3000 −3000 −2000 −1000 0 1000 2000 3000

Relative Distance to TSS (bp) Relative Distance to TTS (bp)

Average Gene Profile 0.8 0.6 Average Profile Average 0.4 0.2

−1000 0 1000 2000 3000 4000

Upstream (bp), 3000 bp of Meta−gene, Downstream (bp)

Average Concatenated Exon Profile Average Concatenated Intron Profile 1.0 1.0 0.8 0.8 0.6 0.6 Average Profile Average Profile Average 0.4 0.4 0.2 0.2

0 20 40 60 80 100 0 20 40 60 80 100

Relative Location (%) Relative Location (%) Molecular Cell Article

Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements HOMER Required for Macrophage and B Cell Identities Sven Heinz,1,7 Christopher Benner,1,7 Nathanael Spann,1,7 Eric Bertolino,4 Yin C. Lin,3 Peter Laslo,6 Jason X. Cheng,4 Cornelis Murre,3 Harinder Singh,4,5 and Christopher K. Glass1,2,* 1Department of Cellular and Molecular Medicine 2Department of Medicine 3Section of Molecular Biology University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA 4Molecular Genetics and Cell Biology, The University of Chicago, 929 East 57th Street GCIS W522, Chicago, IL 60637, USA Motif discovery and NGS data analysis5Department of Discovery Immunology, Genentech, San Francisco, CA 94080, USA 6Section of Experimental Hematology, University of Leeds, Leeds LS9 7TF, UK 7These authors contributed equally to this work *Correspondence: [email protected] DOI 10.1016/j.molcel.2010.05.004

SUMMARY cell type (Tronche and Yaniv, 1992). Comparisons of the genome-wide binding patterns of different transcription factors Genome-scale studies have revealed extensive, cell in a variety of species and cell types have generated two major type-specific colocalization of transcription factors, insights regarding transcription factor binding patterns: (1) but the mechanisms underlying this phenomenon different factors in the same cell type tend to colocalize on a remain poorly understood. Here, we demonstrate in genome-wide scale (Chen et al., 2008; MacArthur et al., 2009), macrophages and B cells that collaborative interac- and (2) the same factor in different cell types or at different stages tions of the common factor PU.1 with small sets of of development exhibits different genome-wide binding patterns (Lupien et al., 2008; Odom et al., 2004; Sandmann et al., 2006). macrophage- or B cell lineage-determining tran- Several mechanisms have been proposed to explain this scription factors establish cell-specific binding sites phenomenon, including protein-protein interactions that enable that are associated with the majority of promoter- ternary complex formation with DNA (Verger and Duterque- distal H3K4me1-marked genomic regions. PU.1 Coquillaud, 2002), pioneering factors that disrupt the closed nucle- binding initiates nucleosome remodeling, followed osome conformation and enable other factors to bind (Cirillo et al., by H3K4 monomethylation at large numbers of 2002), cooperative binding of one or more factors to clustered sites genomic regions associated with both broadly and that facilitates nucleosome displacement and stable binding of the specifically expressed genes. These locations serve factors involved (Boyes and Felsenfeld, 1996; Miller and Widom, as beacons for additional factors, exemplified by 2003), and binding to chromatin marked in a cell type-specific liver X receptors, which drive both cell-specific manner by lysine 4-methylated histone H3 (H3K4me1/2) (Lupien gene expression and signal-dependent responses. et al., 2008), a sign of open chromatin that is correlated with activity of nearby genes (Heintzman et al., 2009). However, how transcrip- Together with analyses of transcription factor tion factors gain access to their eventual binding sites and the binding and H3K4me1 patterns in other cell types, hierarchy of events that generate their cell type-specific binding these studies suggest that simple combinations of and the associated epigenetic modification patterns have not lineage-determining transcription factors can specify previously been elucidated on a genome-wide scale. the genomic sites ultimately responsible for both cell The mammalian hematopoietic system represents a well- identity and cell type-specific responses to diverse characterized model for the analysis of the combinatorial sets signaling inputs. of transcription factors that orchestrate the development of distinct cell types from hematopoietic stem cells. Within the hematopoietic system, macrophages and B cells play essential INTRODUCTION and complementary roles in the innate and adaptive arms of the immune system. Recent studies suggest a model in which The development of complex multicellular organisms involves these cell types are derived from a lymphoid-primed multipoten- hierarchically organized progenitor cells that ultimately give tial progenitor (LMPP) that subsequently gives rise to common rise to terminally differentiated cell types with specialized func- lymphoid progenitor (CLP) and granulocyte-macrophage pro- tions. Cell fates are specified by lineage-determining transcrip- genitor (GMP) cells (Adolfsson et al., 2005)(Figure 1A). The Ets tion factors whose expression is often not limited to a single factor PU.1 is required for the generation of both GMP and http://homer.salk.edu/homer/576 Molecular Cell 38, 576–589, May 28, 2010 ª2010 Elsevier Inc. HOMER: annotate peaks

1 Peak ID 2 Chromosome 3 Peak start position 4 Peak end position 5 Strand 6 Peak Score 7 FDR/Peak Focus Ratio/Region Size 8 Annotation (i.e. Exon, Intron, ...) 9 Detailed Annotation (Exon, Intron etc. + CpG Islands, repeats, etc.) 10 Distance to nearest RefSeq TSS 11 Nearest TSS: Native ID of annotation file 12 Nearest TSS: Gene ID 13 Nearest TSS: Unigene ID 14 Nearest TSS: RefSeq ID 15 Nearest TSS: Ensembl ID 16 Nearest TSS: Gene Symbol 17 Nearest TSS: Gene Aliases 18 Nearest TSS: Gene description 19 Additional columns depend on options selected when running the program. HOMER: compare peaks

Peak Co-Occurrence Statistics Co-Bound Peaks Differentially Bound Peaks REMAP Extensive regulatory catalogue to compare with

http://tagc.univ-mrs.fr/remap/ REMAP Practice Motifs Details in next session

ChIP-seq peaks

>mm9_chr1_39249116_39251316_+ gagaggaagggggagaaagagggagggggagGGTGATAGGTAGCCAGGAG CCAATGGGGGCGTTTTCCTTGTCCAGGCCACTTGCTGGAATGTGAGATGT AGAATGACCCAAAGAGAGCTGCCAAGACAGAGCTCTGCCCCAGGAATTGA DNA sequence ACTCAAAGGGTGTCAGAAAGCAGGTGGCCTTTGTGCACCTGGCGCGGGGA CGTGGCTCCCCTCTTCCGGCTGGTCTAGCCAGGtgcctgcctgcctgcct gccGTGATCTCTGGACGCCAGTAGAGGGTTGTTGTGGGTTTGGGTGAAAC ACGCCACCCCTGAGCTCTTCCGCGGGGCTAGCAATCTCCCCATCACCCCA TTCGCGCTCAGAACCCCCTCAGCGAGTCTAACAGCAGGCCTGGTTCCCCG

Discovered motif A [24 54 59 0 65 71 4 24 9 ] C [ 7 6 4 72 4 2 0 6 9 ] G [31 7 0 2 0 1 1 38 55 ] T [14 9 13 2 7 2 71 8 3 ]

2

1 bits Motif logo AA A G C TG ATT T A G A GG AT C C T CCA CC 0 T TT 5′ 1 2 3 4 5 6 7 8 9 3′ weblogo.berkeley.edu