. Computational analysis of the ENCODE datasets and other related epigenetic explorations .

Ved Topkar

Harvard College class of 2016

Gunawardena Lab Harvard Medical School Department of Systems Biology 13 August 2013

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 1 / 1 Introduction Presentation Goals

FULL understanding of discussed material Ask questions along the way! ...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 2 / 1 Introduction Outline

1. Molecular biology in a jiffy 2. A case study Hypothesis formulation Analyzing data

3. More examples

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 3 / 1 Introduction Outline

1. Molecular biology in a jiffy 2. A case study Hypothesis formulation Analyzing data

3. More examples

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 3 / 1 Introduction Outline

1. Molecular biology in a jiffy 2. A case study Hypothesis formulation Analyzing data

3. More examples

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 3 / 1 Molecular Biology Essentials The

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 4 / 1 Molecular Biology Essentials The Central Dogma

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 5 / 1 Molecular Biology Essentials Transcriptional Regulation

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 6 / 1 Molecular Biology Essentials Transcriptional Access

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 7 / 1 Background and Expression

Things beyond just the base pairs in DNA matter →

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 8 / 1 Background The Question

Analyze the ENCODE dataset

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 9 / 1 Background The Question

Analyze the ENCODE dataset

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 9 / 1 Background

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 10 / 1 Background ENCODE (Overview)

. Overview . National Human Institute: Encyclopedia of DNA Elements (ENCODE) Nearly 600 collaborating labs post HGP .

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 11 / 1 Background The Data Set

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 12 / 1 Background Data Types

Raw signals Raw signal outputs (e.g. PeakSeq results) Relatively course-grain peak data

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 13 / 1 A simple example TFBS The Game Plan

. Can we reduce factor binding landscapes into categories? . Scan across genome, looking for promoters Bin promoters appropriately Score binding at each Clustering analysis .

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 14 / 1 A simple example TFBS RefSeq

. Overview . Curated database of New versions released as frequently as Firefox Includes , haplotype variations, and predicted genes .

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 15 / 1 A simple example TFBS Defining Promoter?

Only upstream from TSS? Incredibly far regulatory regions? Intronic regulation? Post termination regulatory elements? 1000 bp upstream

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 16 / 1 A simple example TFBS Defining Promoter?

Only upstream from TSS? Incredibly far regulatory regions? Intronic regulation? Post termination regulatory elements? 1000 bp upstream

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 16 / 1 A simple example TFBS Defining Promoter?

Only upstream from TSS? Incredibly far regulatory regions? Intronic regulation? Post termination regulatory elements? 1000 bp upstream

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 16 / 1 A simple example TFBS Defining Promoter?

Only upstream from TSS? Incredibly far regulatory regions? Intronic regulation? Post termination regulatory elements? 1000 bp upstream

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 16 / 1 A simple example TFBS Defining Promoter?

Only upstream from TSS? Incredibly far regulatory regions? Intronic regulation? Post termination regulatory elements? 1000 bp upstream

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 16 / 1 A simple example TFBS Binning

How do we quantitatively analyze promoter presence? Break promoter regions into bins for a finer metric? Do we give weights to bins as a function of their position? Single, unweighted 1000 bp bin of counts

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 17 / 1 A simple example TFBS Binning

How do we quantitatively analyze promoter presence? Break promoter regions into bins for a finer metric? Do we give weights to bins as a function of their position? Single, unweighted 1000 bp bin of counts

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 17 / 1 A simple example TFBS Binning

How do we quantitatively analyze promoter presence? Break promoter regions into bins for a finer metric? Do we give weights to bins as a function of their position? Single, unweighted 1000 bp bin of counts

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 17 / 1 A simple example TFBS Binning

How do we quantitatively analyze promoter presence? Break promoter regions into bins for a finer metric? Do we give weights to bins as a function of their position? Single, unweighted 1000 bp bin of counts

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 17 / 1 A simple example Survey of Promoters Forward vs. Backward Strand

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 18 / 1 A simple example Survey of Promoters TF Binding Frequency

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 19 / 1 A simple example Survey of Promoters Histogram of promoter binding

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 20 / 1 A simple example Survey of Promoters Promoter/TFBS intersections

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 21 / 1 A simple example Survey of Promoters Computational Efficiency

This was an exercise in program optimization Original algorithm took about 5 days, optimized/parallelized algorithm took just a few hours

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 22 / 1 A simple example Survey of Promoters Clustering

The unsupervised grouping of information such that groups have similar elements that are dissimilar from elements in other groups

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 23 / 1 A simple example Clustering analysis Clustering

. Minkowski Distance (p = 2 → Euclidean Distance) . ( ) 1 ∑n p p |xi − yi | . i=1

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 24 / 1 A simple example Clustering analysis Clustering

. Simple Linkage Clustering . . D(X , Y ) = min(d(x, y)) ∀ (x, y)ϵ(X , Y )

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 25 / 1 A simple example Clustering analysis Clustering (simple linkage exhibiting chaining)

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 26 / 1 A simple example Clustering analysis Clustering

. Complete Linkage Clustering . . D(X , Y ) = (d(x, y)) ∀ (x, y)ϵ(X , Y )

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 27 / 1 A simple example Clustering analysis Clustering (complete linkage)

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 28 / 1 A simple example Clustering analysis Computational Efficiency

Pairwise distance calculations require a LOT of RAM

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 29 / 1 A simple example Clustering analysis 2-way clustering heatmap

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 30 / 1 A simple example Clustering analysis 2-way clustering heatmap (with random data arrays)

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 31 / 1 A simple example Roadmap Dataset Analysis Modifications (Roadmap Dataset)

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 32 / 1 A simple example Roadmap Dataset Analysis A quick correlation test

R = 0.042

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 33 / 1 A simple example Roadmap Dataset Analysis Conclusions

There are numerous methods of promoter binning that prove useful for complexity reduction Unsupervised clustering of preprocessed promoter data yield results with biological significance

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 34 / 1 A simple example Roadmap Dataset Analysis Next Steps

Incorporation of nucleosome enrichment, methylation, and histone modification data in a more meaningful way Further refining of cluster analysis pipeline to uncover more unknown biology

...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 35 / 1 A simple example End Thanks!

Jeremy Gunawardena and the HMS Department of Systems Biology! My collaborator Tobias Ahsendorf! PRISE and Greg Llacer! The lovely PRISE staff! This audience! ...... Ved Topkar (Harvard College) Epigenetic analysis at promoters 13 August 2013 36 / 1