Tiling Array and ChIP -chip Gene Regulation

Expression No Expression Spatially X

Y

Z

Temporally X

Y A A A Z B B B

C C C X

Y

Z

X

Y

Z Transcription Factors and Their Binding Sites

Transcription factors (TF): TF1 TF2

Transcription factor binding sites (TFBS): CCACCCAC, TAATAAAAT

TF1 TF2

TF1 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...

TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... Transcription factor binding motif

TF GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA 123456789 TF 1 2 3 4 5 6 7 8 9 TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TGGGTGGTC A 0 0 1 0 1 0 0 0 1 TF TGGGTGGTA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA C 0 0 0 0 0 0 0 0 4 TGGGAGGTC TF G 0 6 5 6 0 6 6 0 1 TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG TGGGTGGTG TF T 6 0 0 0 5 0 0 6 0 AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC TGAGTGGTC TF TGGGTGGTC ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG Traascptonscription F actor Bind in gStes(g Sites (TFB S)

123456789 A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17 C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17 T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00

Motif Finding motifs from co-regulated genes

(Ro th e t a l., 1998; Hug hes e t a l., 2000; et c.)

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA G1Gene1 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA Gene2 CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene3 TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

Condition1 Condition2

Gene 1 Gene 2 Gene 3

Gene N Motif discovery is difficult in mammalian due to a low signal -to-noise ratio

100~1000 bp Gene1

100~1000 bp yeast G2Gene2 100~1000 bp Gene3

10k~1000k bp Gene1 10k~1000k bp human Gene2

10k~1000k bp Gene3 ChIP-chip Tilinggy Arrays

• Affyygmetrix genome tiling microarra ys – Tile the genome non-repeat regions – Chr21/22 tiling (earlier version): 1 million probe pairs (PM & MM) at 35 bp resolution on 3 arrays – Whole ggpyenome: 42 million PM probes on 7 arrays

PM CGACATTGATTCAAGACTACATACA MM CGACATTGATTCTAGACTACATACA

Probes Chromosome

By Xiaole Shirley Liu at Harvard Genome Tilinggy Arrays

# Arrays #Probes# Probes # Total Probe Probe human Price / Array Probes Length Resolution genome

Affymetrix 7 6M 42.0M 25mer 35 bp $2,000

Niblimblegen 38 390K 14.8M 50mer 110 b p $30,000

300 bp in genes; Agilent 21 244K 5.1M 60mer $11,000 500 bp in intergenic

By Xiaole Shirley Liu at Harvard ChIP-chippyy Array Hybridization

• Mappg high intensit ypy probes back to the genome • Locate TF binding location

ChIP-DNA

Noise

Probes Chromosome

By Xiaole Shirley Liu at Harvard Identify ChIP-enriched Region

• Controls: sonicated ggpenomic Input DNA • Often 3 ChIP, 3 Ctrl replicates are needed

ChIP

Ctrl

By Xiaole Shirley Liu at Harvard Other Applications

• Transcription factor binding (ChIP-chip)

modifications

• DNA

• Nucleosome positioning

• Copy number variations Back to ChIP-chip Data Analysis

Pi&Preprocessing & Normalization

Peak Detection

Downstream Analyses Rawwd data

ChIP Control Mann-Whitney U-test for ChIP-region Detection • Affy TAS, Cawley et al ( Cell 2004): – Each probe: rank probes (either PM-MM or PM) within [-500bp, +500bp] window – Check whether sum of ChIP ranks is much smaller

By Xiaole Shirley Liu at Harvard TileMap (Ji and W ong, Bi oi nf ormati cs 2005)

STEP 1: Compute a test statistic for each probe to summarize probe level information

STEP 2: CbiCombine pro bllbe level test stati iifstics of neighboring probes to help infer binding regions Probe level test statistic: empirical Bayes approach

Probe 1231 2 3 … I

2 2 2 2 Sample Variance (df) s1 s2 s3 … sI Mean Sum of Squares

2 S = [s 2 − (s 2 )]2 s ∑i i

2 I −1 2 I −1 Shrinkage Factor Bˆ = + (s 2 ) 2 df + 2 I df + 2 S Variance Shrinkage Estimator

2 ˆ 2 ˆ 2 σˆ i = (1− B)si + Bs

2 2 2 … 2 Variance Estimates σˆ1 σˆ 2 σˆ 3 σˆ I A modified t-statistic ~ x − x t = i1 i2 i 1 1 + σˆ ~ ~ ~ ~ K K i Probe level test statistics … 1 2 t1 t2 t3 tI Combininggg neighborin gpg probes

TileMap (MA) 1. Compute the probe level test statistic t for each probe; 2. Compute a moving average statistic to measure enrichment; 3. Estimate FDR.

TileMap (HMM) 1. Compute the probe level test statistic t for each probe; 2. Estimate the distribution of t under

H0 and H1; 3. Model t by a Hidden Markov Model, and decode the HMM. Shrinking variance increases statistical power

Moving Average

t-statistic, variance shrinking

t-statistic, canonical

Mean(X1)-Mean(X2) Peak 2 ((p)g180bp) transgenics

NltbNeural tube express ion TiTransgenics Comparisons between TileMap and previous methods

cMyc ChIP-chip Data: 6IP+6CT1+6CT26 IP + 6 CT1 + 6 CT2 Gold Standard: Using GTRANS and Keles’ method to analyze all 18 arrays Test data: 4 arrays, 2 IP vs 2 CT1 (s2r2)

TileMap-HMM (Ji & Wong, 2005)

GTRANS or TAS (Kampa et al., 2004) 1. Set a window; 2..eo Perform a aWcoosgeda Wilcoxon signed rank test testo for each window.

Keles et al. (2004) 1. Compute a t-statistic t for each probe (no shrinking, two sample only); 2. Rank probes by a moving average. Shrinking variance saves money

Using non-shrinking method (Keles ’ method) to analyze all probes

Using shrinking method to analyze half of the probes, i.e., reduce information by half MAT (Joh nson W .E . et al . PNAS , 2006)

• Model-based Analysis of Tiling arrays for ChIP-chip

• Goal: – Find ChIP-regions without replicates – Find ChIP -region without controls – Find ChIP-regions without MM probes – Can analyyyyyze data array by array

By Xiaole Shirley Liu at Harvard MAT

• Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value • Most of the probes in ChIP-chip measures non-specific hybridization

By Xiaole Shirley Liu at Harvard Probe Behavior Model

Baseline on number of Ts A,C,G,T Count Square

A,C,G at each position 25mer Copy Number of the 25mer along the Genome

By Xiaole Shirley Liu at Harvard Probe Standardization

• Fit the probe model array by array • Divide array probes to bins (3k probes/bin) • BkBackgroun d-subttibtraction an dtdditid standardization (normalization) on a single array;

Observed probe Model predicted intensity probe intensity Log(PM ) − mˆ t = i i i s Observed probe i affinity bin variance within each bin

By Xiaole Shirley Liu at Harvard Eliminate Normalization

• Probe log(PM ) values before and after standardization

• If normalize before model fitting – Predicted same ChIP-regions, although less confident

By Xiaole Shirley Liu at Harvard ChIP-region Detection

• Window-based MATscore – ChIP without Ctrl

MAT(regiion) = TM (t's in regiion) nChIP – TM: trimmed mean – Multiple ChIP with multiple Ctrl TM (t' s in ChIP) − TM (t' s in Input) MAT (region) = nChIP σ Input – More probes, higher t values in ChIP, less variance (fluctuation) Æ more confident

By Xiaole Shirley Liu at Harvard Raw probe values at two spike-in regions with concentration 2X 2X 2X

ChIP_1 Log(PM)

Input_1 Log(PM)

Sequence-based probe behavior standardization ChIP_1 t-value

It1tInput_1 t-value

Window-based neighboring probe combination for ChIP -region detection ChIP_1 MATscore

ChIP_ 1/Input_ 1 MATscore

3 Reps ChIP/Input MATscore

By Xiaole Shirley Liu at Harvard Statistical Significance of Hits

Background

<1% enriched

Enriched DNA

• P-value and FDR cutoff: – P-value from MATscore distribution – Estimate negative peaks under the same P value cutoff – Regional FDR = #negative_peaks / #positive_peaks

By Xiaole Shirley Liu at Harvard MAT summary

• Opppyen source python http://chip.dfci.harvard.edu/~wli/MAT/ • Runs faster than array scanner • Can work with single ChIP, multiple ChIP, and multiple ChIP with controls with increasing accuracy – UilChIPUse single ChIP on promo ter arrays ttttibdto test and protocol before going whole genome • Can identify individual failed samples

By Xiaole Shirley Liu at Harvard Benchmark for ChIP-chip Target Detection (Joh nson D .S . e t a l. G enome R esearch , 2008)

• ENCODE Spike-in experiment: both amplified and un-amplified

ChIP Input 96 ENCODE clones, 2,484,8,...,256X enrihichment + ttltotal genomi c DNA total chromatin DNA

• Blind test: Samples hybridized to different tiling arrays, predictions made before the key was released Comppparison of platforms Comppgarison of algorithms

Combined Johnson D.S. et al. Genome Research 2008 with Ji H. et al. Nature Biotechnology 2008 Residual Probe Effects after MAT TileProbe (Judy & Ji, Bioinformatics, 2009) TileProbe vs. MAT (()GLI3)

1IP 0CT 3IP 0CT TileProbe vs. MAT (()Oct4)

1IP 0CT 3IP 0CT TileProbe vs. MAT (()NRSF)

1IP 0CT 2IP 0CT Motif enrichment MBR: Microarray Blob Remover

By Xiaole Shirley Liu at Harvard xMAN: eXtreme MApping of

• http://chip. dfci. harvard. edu/~wli/xMAN • xMAN maps ~42 M Affymetrix tiling probes to the newest human genome assembly in less than 6 CPU hours – BLAST needs 20 CPU years; BLAT needs 55 CPU days – Probe TCCCAGCACTTTGGGAGGCTGAGGC mapp,s to 50,660 times in the genome • Can map long oligos, and paired tag high throughput sequencing fragments • Store the copy number information of every probe • mXAN filters tiling array probes to ensure one unique probe measurement per 1 kb, improves peak detection

By Xiaole Shirley Liu at Harvard CisGenome (Ji H. e t a l. N a ture Bi o tech nol ., 2008)

Graphic User Interface

CisGenome Browser

Core Data Analysis Programs CEAS: Cis-regulatory Element Annotation System • Data Analysis Button for Biologists

http://ceas.cbi.pku.edu.cn By Xiaole Shirley Liu at Harvard