Promoter Analysis & Set Enrichment

Steven H. Kleinstein

Department of Pathology Yale University School of Medicine [email protected]

May 6, 2010 Lecture & Lab Outline

• Promoter analysis

• Over-representation analysis

• Gene set enrichment analysis

Lab section by Uri Hershberg

IllustrateIllustrate somesome generalgeneral approachesapproaches andand conceptsconcepts Identifying regulators of TLR responses Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS) Each rowisagene

Time (hours)

HypothesizeHypothesize thatthat genesgenes withwith similarsimilar temporaltemporal kineticskinetics areare co-regulatedco-regulated andand thatthat theythey shareshare regulatorsregulators Identifying regulators of TLR responses Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

K-means clustering defined 11 groups of comprising regulated ‘waves’ of transcription

HypothesizeHypothesize thatthat clusteredclustered genesgenes areare co-regulatedco-regulated andand thatthat theythey shareshare cis-regulatorycis-regulatory elementselements Transcriptional regulation by promoters and enhancers General transcription factors (green ovals) bind to core promoter regions through recognition of common elements such as TATA boxes and initiators (INR)

(Farnham, Nature Reviews Genetics, 2009)

PromoterPromoter activity activity can can be be altere alteredd by by site-specific site-specific DNA-binding DNA-binding factors factors (red(red trapezoid)trapezoid) interactinginteracting withwith ciscis elements elements (dark(dark blueblue box)box) DNA Sequence Motifs for TF Binding Sites Short, recurring patterns in DNA with presumed biological function

Nature Biotechnology 24, 423 - 425 (2006)

Collection of binding sites (ROX1 )

Consensus sequence

Frequency Matrix

ForFor predictionprediction ofof newnew sites,sites, needneed toto accountaccount forfor conservationconservation Measuring Conservation in the Binding Site Information content measures conservation at each site

Measure of conservation ATG at each position i: ATC AAT AAA --- 210 Information content

Sequence Logo

Total information content related to probability of finding motif in ‘random’ DNA sequence

http://weblogo.berkeley.edu/http://weblogo.berkeley.edu/ The TRANSFAC Database Eukaryotic transcription factors and their genomic binding sites

TRANSFAC has public (older version) and commercial (more features) versions

Other (free) possibility:

CurrentCurrent versionversion containscontains 834834 matricesmatrices (601(601 vertebrate)vertebrate) The TRANSFAC Database Eukaryotic transcription factors and their genomic binding sites

MATCH Score

Information Vector (higher for conserved positions)

C C C C C C T T G Frequency of nucleotide bi to occur at the position G A i of the matrix (B{A, T, G, C}) A C C G G T T C C A A A A C C G G

AssumesAssumes positionspositions areare independentindependent Identifying putative TF binding sites Search by scanning the promoter region

MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2: e36.

ThresholdThreshold cancan bebe determineddetermined byby lookinglooking atat “random”“random” DNA DNA Identifying putative TF binding sites Integrative approaches improve predictions – active research area

(Hannenhalli, Bioinformatics, 2008)

‘Gene‘Gene Sets’Sets’ of of targettarget genesgenes forfor eacheach transcriptiontranscription factorfactor Focus on proximal promoter regions Common practice to consider 1-2Kb region around TSS

ChIP-chip data is mixed Predicted binding sites

~80% > 10Kb

Experimentally confirmed ~50% > 10Kb

(Ananko et al, BMC Bioinformatics, 2007) TSS (Hua et al, MSB, 2008)

RecentRecent genome-widegenome-wide datadata callscalls thisthis intointo questionquestion Focus on evolutionarily conserved regions

98% experimentally defined sequence-specific binding sites of skeletal-muscle- specific TFs confined to 19% of human sequences most conserved in rodent (Wasserman et al., Nat Genet. 2000)

Sequence identity >65% identifies 72% of the known TFBSs Evolutionary conservation excludes known sites (Sauer et al, Bioinformatics. 2006) 32-40% of functional human binding sites are not functional in rodents (Dermitzakis and Clark, Mol Biol Evol., 2002)

RequiringRequiring human–mouse–rat human–mouse–rat genomic genomic alignm alignmentsents provided provided a a 44-fold 44-fold increase increase in in thethe specificity specificity of of TRANSFAC TRANSFAC predictions predictions(Rat(Rat Genome Genome Sequencing Sequencing Project, Project, Nature, Nature, 2004) 2004) Variation in TF binding across individuals 6% of binding regions within 1 kb of transcription start sites (TSSs) of RefSeq genes differed significantly across individuals

ChIP-Seq Analysis SNPs in motif predict binding sites

Also correlated with Binding and expression match to consensus site are correlated

PolIIPolII binding binding between between humans humans and and chimpanzee chimpanzee suggests suggests extensive extensive divergence divergence Identifying Target Genes Scan 2kb up-stream of transcription start site

1. Extract genomic sequence (-2kb of TSS)

2. Scan conserved regions for potential binding sites using TRANSFAC binding matrices

3. Identify conserved sites (Human/Chimp/Mouse)

TF 1 TF 2 … TF M Gene 1   Table linking transcription factors Gene 2  and putative target genes …  Gene N 

‘Gene‘Gene Sets’Sets’ of of targettarget genesgenes forfor eacheach transcriptiontranscription factorfactor Gene Sets of Transcription Factor Targets Molecular Signatures Database at Broad Institute (http://www.broad.mit.edu/gsea/msigdb)

V$NRSF_01 (Neuron Restrictive Silencing Factor)

Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TTCAGCACCACGGACAGMGCC which matches annotation for REST: RE1-silencing transcription factor

ATP6V0A1 RPIP8 POU4F3 FLJ42486 L1CAM SLC17A6 TRIM9 MAPK11 DDX25 SNAP25 DRD3 FGF12 COL5A3 SYT4 BDNF POMC GABRB3 TMEM22 GRM1 HES1 MGAT5B TCF1 PCSK2 FLJ44674 VIP FLJ38377 ZNF335 GABRG2 LHX3 DNER CHKA NEFH ZNF579 CHAT SCAMP5 CDKN2B SST OGDHL KCNH4 SEZ6 GLRA1 HTR1A RPH3A PRG3 NPPB FGD2 RNF13 SYT6 CHGA SLC12A5 ELAVL3 KCNH8 GDAP1L1 HCN1 DRD2 HCN3 PAQR4 CALB1 BARHL1 SCN3B CRYBA2 TNRC4 VGF RASGRF1 NEF3 OMG KCNIP2 CDK5R1 ATP2B2 HTR5A PHYHIPL SARM1 GHSR INA PTPRN DBC1 CSPG3 CHRNB2 GRIN1 STMN2 POU4F2 APBB1 GLRA3

GeneGene setssets cancan alsoalso bebe defineddefined manuallymanually Which TFs are driving dynamics of each cluster? Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

LookLook for for TF TF targets targets that that are are ‘over-represented’ ‘over-represented’ in in a a cluster cluster Over-Representation Analysis If you draw n marbles at random, what is probability of k green ones?

Total Marbles Green Marbles (N) (K)

Pick (n)

Green Marbles (k) Adapted from Can (John) Bruce

K NK    knk HypergeometricHypergeometric Distribution: Distribution: Pk(|, nK , N )   N  ProbabilityProbability ofof kk greengreen ifif nn isis randomrandom samplesample n Over-Representation Analysis Is set of TF targets over-represented among genes in cluster?

Total genes Genes with binding site (N) (K)

Genes in cluster (n)

Genes with binding site (k) Adapted from Can (John) Bruce HypergeometricHypergeometric Distribution: Distribution: ProbabilityProbability ofof kk TFTF targetstargets ifif clustercluster isis randomrandom samplesample Over-Representation Analysis If 17 genes in cluster, 5 with transcription factor binding site…

Total genes Genes with binding site (1000) (100)

Genes in cluster (17) 100 1000 100    5175 P(5 |17,100,1000)  0.017 1000  17

17  Px( |17,100,1000) 0.02 Genes with binding site x5 (5) Adapted from Can (John) Bruce

MustMust choosechoose thresholdthreshold toto definedefine “differential“differential expression”expression” Identifying regulators of TLR responses Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

K-means clustering defined 11 groups of genes comprising regulated ‘waves’ of transcription

WhatWhat isis thethe rolerole ofof ATF3?ATF3? Network Analysis: role of ATF3? “Guilt by association” Highly connected are likely to be functionally related

–protein interaction network

ATF3ATF3 (red) (red) interacts interacts with with AP1 AP1 (light (light blue) blue) and and NF- NF- B B (light (light green) green) TF TF complexes complexes What is the role of ATF3?

Identified many target genes with nearby ATF3 and NFkB binding sites

Temporal recruitment of ATF3 and Rel to Il6 and Il12b promoters

ChIP assays

HowHow does does ATF3 ATF3 regulate regulate IL6 IL6 and and IL12b? IL12b? What is the role of ATF3? Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

IL6 mRNA (Atf3-/-) Model used to predict IL6 mRNA as ATF3 IL6 mRNA function of Rel and ATF3 binding mRNA degradation

IL6 mRNA (predicted)

Rel Influence on transcription

Predict

Change in IL6 mRNA

ATF3ATF3 isis aa negativenegative regulatorregulator ofof IL6IL6 andand IL12bIL12b Which TFs are driving dynamics of each cluster? Temporal activation of macrophages by TLR4 agonist bacterial lipopolysaccharide (LPS)

NeedNeed toto assignassign genesgenes toto singlesingle clustercluster Can we identify TFs driving B cell differentiation? Implicate TFs by analyzing behavior of target genes

Experiment Naive B (B cell subset) If genes targeted by particular transcription factor are differentially expressed, then the transcription factor is likely to play role GC B Gene

Memory B

NeedNeed toto identifyidentify whichwhich genesgenes areare differentiall-expresseddifferentiall-expressed Gene Set Enrichment Analysis (GSEA) Are TF targets enriched among most differentially expressed genes?

Rank genes by expression Enrichment Score Running SumStatistic

Transcription factor targets (Subramanian et al, PNAS, 2005)

DoesDoes notnot requirerequire aa thresholdthreshold forfor differentialdifferential expressionexpression Gene Set Enrichment Analysis (GSEA) What is distribution for enrichment score (ES) under null hypothesis?

Distribution of ES values for “random” data

Random Calculate P value is fraction of “random” permutations of data ES data with higher ES

PermutePermute classclass labelslabels oror genesgenes toto estimateestimate nullnull distributiondistribution Can we identify TFs driving mutation targeting? Are particular motifs enriched among the most mutated genes?

If genes targeted by particular transcription factor tend to be more mutated, then the transcription factor is likely to play role

TargetTarget genesgenes identifiedidentified byby presencepresence ofof bindingbinding sitessites Does E2a influence AID targeting? Are transcription factor target genes enriched among the most mutated?

Computational screen including E2a + Gene Set Enrichment Analysis all TRANSFAC transcription factors (Subramanian et al, PNAS, 2005)

dKO Mutation Frequency E2a binding sites top (and only) significant hits

Genes with binding sites (+/- 2Kb) Found through computational screen

YesYes, ,E2a E2a sites sites enriched enriched among among mutated mutated genes genes in in UNG/MSH2 UNG/MSH2 dKO dKO mice mice Other Applications of Gene Set Enrichment Analysis

Molecular Signatures Database at Broad Institute

GeneGene setssets cancan alsoalso bebe defineddefined manuallymanually Structured, controlled vocabularies (ontologies) that describe gene products in terms of associated biological processes, cellular components and molecular functions

Organization and functional annotation of molecular aspects of cellular system

AnnotationsAnnotations include include evidenceevidence code code (experimental(experimental and and computational)computational)

(Lovering et al, Immunology, 2008) For more information:

OrOr sendsend meme emailemail at:at: [email protected]@yale.edu