Tiling Array and ChIP -chip Gene Regulation
Expression No Expression Spatially X
Y
Z
Temporally X
Y A A A Z B B B
C C C X
Y
Z
X
Y
Z Transcription Factors and Their Binding Sites
Transcription factors (TF): TF1 TF2
Transcription factor binding sites (TFBS): CCACCCAC, TAATAAAAT
TF1 TF2
TF1 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...
TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... Transcription factor binding motif
TF GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA 123456789 TF 1 2 3 4 5 6 7 8 9 TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TGGGTGGTC A 0 0 1 0 1 0 0 0 1 TF TGGGTGGTA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA C 0 0 0 0 0 0 0 0 4 TGGGAGGTC TF G 0 6 5 6 0 6 6 0 1 TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG TGGGTGGTG TF T 6 0 0 0 5 0 0 6 0 AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC TGAGTGGTC TF TGGGTGGTC ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG Traascptonscription F actor Bind in gStes(g Sites (TFB S)
123456789 A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17 C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17 T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
Motif Finding motifs from co-regulated genes
(Ro th e t a l., 1998; Hug hes e t a l., 2000; et c.)
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA G1Gene1 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA Gene2 CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene3 TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
Condition1 Condition2
Gene 1 Gene 2 Gene 3
…
Gene N Motif discovery is difficult in mammalian genomes due to a low signal -to-noise ratio
100~1000 bp Gene1
100~1000 bp yeast G2Gene2 100~1000 bp Gene3
10k~1000k bp Gene1 10k~1000k bp human Gene2
10k~1000k bp Gene3 ChIP-chip Genome Tilinggy Arrays
• Affyygmetrix genome tiling microarra ys – Tile the genome non-repeat regions – Chr21/22 tiling (earlier version): 1 million probe pairs (PM & MM) at 35 bp resolution on 3 arrays – Whole ggpyenome: 42 million PM probes on 7 arrays
PM CGACATTGATTCAAGACTACATACA MM CGACATTGATTCTAGACTACATACA
Probes Chromosome
By Xiaole Shirley Liu at Harvard Genome Tilinggy Arrays
# Arrays #Probes# Probes # Total Probe Probe human Price / Array Probes Length Resolution genome
Affymetrix 7 6M 42.0M 25mer 35 bp $2,000
Niblimblegen 38 390K 14.8M 50mer 110 b p $30,000
300 bp in genes; Agilent 21 244K 5.1M 60mer $11,000 500 bp in intergenic
By Xiaole Shirley Liu at Harvard ChIP-chippyy Array Hybridization
• Mappg high intensit ypy probes back to the genome • Locate TF binding location
ChIP-DNA
Noise
Probes Chromosome
By Xiaole Shirley Liu at Harvard Identify ChIP-enriched Region
• Controls: sonicated ggpenomic Input DNA • Often 3 ChIP, 3 Ctrl replicates are needed
ChIP
Ctrl
By Xiaole Shirley Liu at Harvard Other Applications
• Transcription factor binding (ChIP-chip)
• Chromatin modifications
• DNA methylation
• Nucleosome positioning
• Copy number variations Back to ChIP-chip Data Analysis
Pi&Preprocessing & Normalization
Peak Detection
Downstream Analyses Rawwd data
ChIP Control Mann-Whitney U-test for ChIP-region Detection • Affy TAS, Cawley et al ( Cell 2004): – Each probe: rank probes (either PM-MM or PM) within [-500bp, +500bp] window – Check whether sum of ChIP ranks is much smaller
By Xiaole Shirley Liu at Harvard TileMap (Ji and W ong, Bi oi nf ormati cs 2005)
STEP 1: Compute a test statistic for each probe to summarize probe level information
STEP 2: CbiCombine pro bllbe level test stati iifstics of neighboring probes to help infer binding regions Probe level test statistic: empirical Bayes approach
Probe 1231 2 3 … I
2 2 2 2 Sample Variance (df) s1 s2 s3 … sI Mean Sum of Squares
2 S = [s 2 − (s 2 )]2 s ∑i i
2 I −1 2 I −1 Shrinkage Factor Bˆ = + (s 2 ) 2 df + 2 I df + 2 S Variance Shrinkage Estimator
2 ˆ 2 ˆ 2 σˆ i = (1− B)si + Bs
2 2 2 … 2 Variance Estimates σˆ1 σˆ 2 σˆ 3 σˆ I A modified t-statistic ~ x − x t = i1 i2 i 1 1 + σˆ ~ ~ ~ ~ K K i Probe level test statistics … 1 2 t1 t2 t3 tI Combininggg neighborin gpg probes
TileMap (MA) 1. Compute the probe level test statistic t for each probe; 2. Compute a moving average statistic to measure enrichment; 3. Estimate FDR.
TileMap (HMM) 1. Compute the probe level test statistic t for each probe; 2. Estimate the distribution of t under
H0 and H1; 3. Model t by a Hidden Markov Model, and decode the HMM. Shrinking variance increases statistical power
Moving Average
t-statistic, variance shrinking
t-statistic, canonical
Mean(X1)-Mean(X2) Peak 2 ((p)g180bp) transgenics
NltbNeural tube express ion TiTransgenics Comparisons between TileMap and previous methods
cMyc ChIP-chip Data: 6IP+6CT1+6CT26 IP + 6 CT1 + 6 CT2 Gold Standard: Using GTRANS and Keles’ method to analyze all 18 arrays Test data: 4 arrays, 2 IP vs 2 CT1 (s2r2)
TileMap-HMM (Ji & Wong, 2005)
GTRANS or TAS (Kampa et al., 2004) 1. Set a window; 2..eo Perform a aWcoosgeda Wilcoxon signed rank test testo for each window.
Keles et al. (2004) 1. Compute a t-statistic t for each probe (no shrinking, two sample only); 2. Rank probes by a moving average. Shrinking variance saves money
Using non-shrinking method (Keles ’ method) to analyze all probes
Using shrinking method to analyze half of the probes, i.e., reduce information by half MAT (Joh nson W .E . et al . PNAS , 2006)
• Model-based Analysis of Tiling arrays for ChIP-chip
• Goal: – Find ChIP-regions without replicates – Find ChIP -region without controls – Find ChIP-regions without MM probes – Can analyyyyyze data array by array
By Xiaole Shirley Liu at Harvard MAT
• Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value • Most of the probes in ChIP-chip measures non-specific hybridization
By Xiaole Shirley Liu at Harvard Probe Behavior Model
Baseline on number of Ts A,C,G,T Count Square
A,C,G at each position 25mer Copy Number of the 25mer along the Genome
By Xiaole Shirley Liu at Harvard Probe Standardization
• Fit the probe model array by array • Divide array probes to bins (3k probes/bin) • BkBackgroun d-subttibtraction an dtdditid standardization (normalization) on a single array;
Observed probe Model predicted intensity probe intensity Log(PM ) − mˆ t = i i i s Observed probe i affinity bin variance within each bin
By Xiaole Shirley Liu at Harvard Eliminate Normalization
• Probe log(PM ) values before and after standardization
• If normalize before model fitting – Predicted same ChIP-regions, although less confident
By Xiaole Shirley Liu at Harvard ChIP-region Detection
• Window-based MATscore – ChIP without Ctrl
MAT(regiion) = TM (t's in regiion) nChIP – TM: trimmed mean – Multiple ChIP with multiple Ctrl TM (t' s in ChIP) − TM (t' s in Input) MAT (region) = nChIP σ Input – More probes, higher t values in ChIP, less variance (fluctuation) Æ more confident
By Xiaole Shirley Liu at Harvard Raw probe values at two spike-in regions with concentration 2X 2X 2X
ChIP_1 Log(PM)
Input_1 Log(PM)
Sequence-based probe behavior standardization ChIP_1 t-value
It1tInput_1 t-value
Window-based neighboring probe combination for ChIP -region detection ChIP_1 MATscore
ChIP_ 1/Input_ 1 MATscore
3 Reps ChIP/Input MATscore
By Xiaole Shirley Liu at Harvard Statistical Significance of Hits
Background
<1% enriched
Enriched DNA
• P-value and FDR cutoff: – P-value from MATscore distribution – Estimate negative peaks under the same P value cutoff – Regional FDR = #negative_peaks / #positive_peaks
By Xiaole Shirley Liu at Harvard MAT summary
• Opppyen source python http://chip.dfci.harvard.edu/~wli/MAT/ • Runs faster than array scanner • Can work with single ChIP, multiple ChIP, and multiple ChIP with controls with increasing accuracy – UilChIPUse single ChIP on promo ter arrays ttttibdto test antibody and protocol before going whole genome • Can identify individual failed samples
By Xiaole Shirley Liu at Harvard Benchmark for ChIP-chip Target Detection (Joh nson D .S . e t a l. G enome R esearch , 2008)
• ENCODE Spike-in experiment: both amplified and un-amplified
ChIP Input 96 ENCODE clones, 2,484,8,...,256X enrihichment + ttltotal genomi c DNA total chromatin DNA
• Blind test: Samples hybridized to different tiling arrays, predictions made before the key was released Comppparison of platforms Comppgarison of algorithms
Combined Johnson D.S. et al. Genome Research 2008 with Ji H. et al. Nature Biotechnology 2008 Residual Probe Effects after MAT TileProbe (Judy & Ji, Bioinformatics, 2009) TileProbe vs. MAT (()GLI3)
1IP 0CT 3IP 0CT TileProbe vs. MAT (()Oct4)
1IP 0CT 3IP 0CT TileProbe vs. MAT (()NRSF)
1IP 0CT 2IP 0CT Motif enrichment MBR: Microarray Blob Remover
By Xiaole Shirley Liu at Harvard xMAN: eXtreme MApping of oligoNucleotides
• http://chip. dfci. harvard. edu/~wli/xMAN • xMAN maps ~42 M Affymetrix tiling probes to the newest human genome assembly in less than 6 CPU hours – BLAST needs 20 CPU years; BLAT needs 55 CPU days – Probe TCCCAGCACTTTGGGAGGCTGAGGC mapp,s to 50,660 times in the genome • Can map long oligos, and paired tag high throughput sequencing fragments • Store the copy number information of every probe • mXAN filters tiling array probes to ensure one unique probe measurement per 1 kb, improves peak detection
By Xiaole Shirley Liu at Harvard CisGenome (Ji H. e t a l. N a ture Bi o tech nol ., 2008)
Graphic User Interface
CisGenome Browser
Core Data Analysis Programs CEAS: Cis-regulatory Element Annotation System • Data Analysis Button for Biologists
http://ceas.cbi.pku.edu.cn By Xiaole Shirley Liu at Harvard