10/30/15

Introduction to genome biology

Lisa Stubbs

We’ve found most ; but what about the rest of the genome?

Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000 ~23000 ~21000 # transcripts ~7000 ~50000 ~29000 ~53000 ~93000 ~200000 Kb/ 1714 bp 4750 bp 12143 bp 57,692 bp 117381 bp 152381 bp *data taken from ENSEMBL genome browser www.ensembl.org

Most notably: • Coding gene number is relatively constant in metazoans, BUT • Number of alternative transcripts per gene and Gene density are not – Each gene gives rise to many more isoforms: sequence diversity – Much more non-coding DNA, including gene regulatory DNA

1 10/30/15

Most traditional studies have focused on promoters and nearby (proximal) enhancers

• Promoter regions are most likely to be involved in recruiting RNA polymerase and related – TATA binding proteins (TAFs) – General transcription factors (GTFs) – Mediator complexes

• Some transcription factors (TF) are also more likely to be found at promoter sites – SP1, E2F family are classical examples

• BUT, most other metazoan TFs are found preferentially at distant sites – Introns, intergenic regions – Some may be 100s or 1000s of bp from the target promoter, or even embedded within neighboring genes

Transcription factors and their binding sites

• Most known TFs have short, and variable binding sites, e.g.

YY1 SP1 Mzf1

• BUT The probability of finding a string such as the Yy1 “core” (even as a simple string, rather than a matrix) is (1/4)4 = 1/256 bp! – Most TFBS are not much more specific than this!

• So, how to raise the probability that the site you find is functional? 1. Interspecies conservation: sites that are found in similar locations in diverse are more likely to be functional 2. Site clustering: most TFBS form homo- or heterodimers that significantly stabilize binding and influence function 3. Location within regions that are known to be in an “open” state in the type and conditions of interest

2 10/30/15

How to find the regulatory needles in the haystack? • Vertebrate genomes are mostly non-coding – ~2% coding; ~5% noncoding and evolutionarily conserved (at the DNA sequence alignment level) • Websites to view pre-aligned sequence conservation levels abound; e.g. the ECR browser http://ecrbrowser.dcode.org/ • zPicture and Mulan provide “do it yourself” tools for pairwise or multi- sequence alignments of up to 1Mb; http://zpicture.dcode.org/, http://mulan.dcode.org/ • All three tools allow detection of conserved TFBS from Transfac, Jaspar, and other databases

Conserved motifs are more likely to be functional…

• As long as the biology you are interested in is also conserved – Important to consider the appropriate species for comparisons

3 10/30/15

ECR details: Step 2

Summary of conserved TFBS

4 10/30/15

Spaal display Of conserved TFBS

Focusing on accessible chromatin

• Even well conserved motifs cannot be accessed in closed regions of chromatin

Not accessible e.g. H3K9Me3, H3K27Me3

e.g. H3K27Ac accessible

5 10/30/15

How to find active elements? Chromatin immunoprecipitation with TF and histone-modification antibodies

• Chromatin and attendant proteins are chemically crosslinked (lightly) using formaldehyde – Crosslinking will also attach proteins to each other, so that detection of secondary chromatin interactions is inevitable + • Cross-linked chromatin is randomly sheared by sonication (average fragment size 200-500bp)

• Sonicated fragments in solution are exposed to a protein-specific antibody

• Antibody is retrieved with DNA still attached

• DNA is released with salt and heat (reverses the crosslinks) – Library is created for sequencing : ligation of “tags” and light PCR amplification ATGGCCTTAACGA….. – Sequenced directly e.g. illumina sequencing

Sequence-based ChIP approaches…

• Harness ChIP, DNAse sensitivity, and other assays, to Illumina sequencing – ChIP enriched DNA is ligated to Illumina linkers and sequenced directly

– If you experiment works, you’ve enriched a very small fraction of the genome:

– Requires a lot of input chromatin! Traditional methods need ~10^7 cells per experiment!!

– Critical step is an efficient, selective antibody (and very few exist)

6 10/30/15

ChIP computational issues

• Sequence is read from randomly position ends of multiple, overlapping randomly sheared fragments – Reads will be scattered around a distance ~2X shear fragment length; – ChIP seq reads surround but may not contain the DNA binding site • Computational tools (like MACS) need to join adjacent sets of read peaks and define a “shift” distance between read peaks to determine a summit

Binding site Seq reads ChIP fragments

Analytical considerations • Genomic neighborhoods – Shear efficiency is not really “random” • Some genomic regions are fragile and sensitive; some are protected • Chromatin-matched, co-sheared controls are essential • Most peak-finders are strongly biased to compare controls and experimental with similar numbers of reads • Repeatability is key – Biological, or at least technical, replicates are also essential – Artifactual peaks are very easy to generate! – Other ways to validate: • Known targets • Known motifs • Similar targets in different cell types or tissues • Peak width – Transcription factors typically yield sharp peaks; chromatin marks are sometimes broader and more diffuse

7 10/30/15

• User-friendly tools – MACS: • ‘Model based” peak detection, is sensitive to peak enrichment and background • Zhang et al, Genome Biology 2008, Feng et al. 2012, Nat Procols PMID: 22936215 (Xiaole Liu lab); • MACS1 is best for sharp peaks (TFs); will break diffuse peaks into smaller regions • MACS2 is designed to allow broad- or sharp-peak detection – HOMER (http://homer.salk.edu/homer) • Can be easily tweaked for more sensitive peak detection • Comes packaged wiith a rich set of peak annotation tools • Tools for DNAse-seq, High-C, differential ChIP analysis and many more – Both tools permit generation of “wiggle files” or similar that can be viewed in the UCSC browser • Looking at your data is a very important step! Peak finders can miss peaks that you can easily see by eye!

Differential ChIP and connection to differential expression

• Just like differential

sequence analysis Scale 5 kb mm9 chr15: 76,304,000 76,305,000 76,306,000 76,307,000 76,308,000 76,309,000 76,310,000 76,311,000 76,312,000 76,313,000 UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics) – comparison requires Bop1 Hsf1 Hsf1 rigorous normalization Hsf1 Hsf1 Hsf1 • Normalization is 200 _ 94-95 Frontal Cortex 120 min control samples 1+2 1M cells H3K4me3 ChIP 94-95 FCX120 CK1+2 1M H3K4me3 ChIP complicated for ChIP 5 _ 200 _ 99-100 Frontal Cortex 120 min exp samples 1+2 1M cells H3K4me3 ChIP – peak height? Peak shape?99-100 FCX120 EX1+2 1M H3K4me3 ChIP 5 _ Summit position? Read 70 _ 42-46 Frontal Cortex 30 min control sample 1+2 5M h3k27ac 42-46 FCX30 CK1+2 5M h3k27ac ChIP

density? Local 5 _ 70 _ 41-45 Frontal Cortex 30 min experimental sample 1+2 5M h3k27ac neighborhoods? 41-45 FCX30 EX1+2 5M h3k27ac ChIP

5 _ – Not as simple as an 40 _ 69-70 Frontal Cortex 120 min control sample 1+2 4M cells h3k4me1 intensity score or a yes/no69-70 FCX120 CK1+2 4M h3k4me1 ChIP 5 _ count 40 _ 72-73 Frontal Cortex 120 min experimental sample 1+2 4M cells h3k4me1 72-73 FCX120 EX1+2 4M h3k4me1 ChIP ?

5 _ • Chromatin dynamics and 30 _ 108+109 Frontal Cortex 120 min exp samples 1+2 5M cells H3K27me3 ChIP 108+109 FCX120 EX1+2 5M H3K27me3 ChIP

expression dynamics 5 _ 30 _ 108+109 Frontal Cortex 120 min control samples 1+2 5M cells H3K27me3 ChIP – *might* or *might not*108+109 be FCX120 CK1+2 5M H3K27me3 ChIP 5 _ temporally coordinated Cortex 8w H3K27ac Histone Mods by ChIP-seq Peaks from ENCODE/LICR Cortex 8w H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/LICR

Cortex 8w H3K4me1 Histone Mods by ChIP-seq Signal from ENCODE/LICR

Cortex 8w H3K4me1 Histone Mods by ChIP-seq Peaks from ENCODE/LICR

Mouse ESTs That Have Been Spliced Spliced ESTs Cortex 8w H3K4me3 Histone Mods by ChIP-seq Peaks from ENCODE/LICR

Cortex 8w H3K27ac Histone Mods by ChIP-seq Signal from ENCODE/LICR

Mouse mRNAs from GenBank Mouse mRNAs

8 10/30/15

Data from ChIP with TFs, modified Histones, and other proteins are available for (and to some degree, mouse and flies) as Tables in the UCSC genome browser (www.genome.ucsc.edu)

From Hoffman et al, Nucl Acid Res 41:827, 2013

Genome Biology Topic overview

• Lectures – Ross Hardison • Basics of gene regulation, epigenetics and ENCODE results – David Hawkins • Chromatin states, biological applications – James Taylor • Higher dimension chromatin structure – Lisa Stubbs • Integrating data for biological inference: Basics of Expression correlation methods • Workshops – Bowtie and MACS on Galaxy (this morning) – Peaks to features in Galaxy (afternoon) – Bowtie and MACs / Tophat->Cuffdiff on the command line (this evening) – Monday: student’s choice • “How to” for ECR browser and Z-picture (sequence alignments and conserved motifs) • Simple methods for expression correlation: Cluster and Cytoscape • ChIP peaks to Meme-ChIP (online connection to the meme suite for large peak sets) • DAVID functional clustering analysis (GO and pathway analysis tools online

9