Gene regulaon: ENCODE project and the Human Epigenome Atlas

Joshua W. K. Ho, PhD

Head, Bioinformacs and Systems Medicine Laboratory Victor Chang Cardiac Research Instute

NHMRC/Naonal Heart Foundaon Career Development Fellow

Senior Lecturer (Conjoint), St Vincent’s Clinical School, UNSW

hp://bioinformacs.victorchang.edu.au

[email protected]

UNSW, 20th Apr 2017 Human genome has 23 pairs of chromosomes

Human somatic cells (ie, not gametes) have a diploid genome – one haploid genome from each parent

Each chromosome is a double helix

A haploid genome contains ~ 20,000 -coding genes

22 pairs of autosomes 1 pair of sex chromosomes

Complete* human genome sequenced by 2003 2 Beyond DNA sequence: chromatin organisation is important for epigenetic regulation

One nucleosome

Nucleosome core = 146bp of DNA + histone octamer

Linker DNA = ~ 10-80 bp in length

Density nucleosome is an important determinant of chroman accessibility 3

Transcription factor (TF), RNA polymerase, histone modification, chromatin accessibility, and DNA methylation

Wang et al. (2016) Computaonal Biology & Bioinformacs: Gene Regulaon 4 Histone modification is associated with genomic regulatory elements

H3K4me3 = Trimethylaon of histone H3 at lysine 4

Promoters: H3K4me3 & some H3K4me1 & H3K27ac Enhancers: H3K4me1 & H3K27ac and some H3K4me3 Repressed: H3K27me3 or H3K9me3 Ho (2012) Biophysical Reviews 5 The ENCODE Project

The Encyclopedia of DNA Elements (ENCODE) Consorum is an internaonal collaboraon of research groups funded by the Naonal Human genome Research Instute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of funconal elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which gene is acve. hps://www.encodeproject.org 6

ENCODE – past and present

• Launched in September 2003 to identify all functional elements in the human genome sequence – as a follow-up to the Human Genome Project.

• 2003-2007: The pilot phase and technology development phase. Test and compare existing method to rigorously analyse a defined portion (1%) of the human genome sequence.

• 2007-now: genome-wide data production phase

• Major finding: showing that ~80% of the genome participate in at least one biochemical RNA- and/or chromatin- associated event in at least one cell type (ENCODE 2012, Nature)

7 Who are the ENCODE team? USA UK

Spain

Israel

hps://www.encodeproject.org/about/contributors/ List of current ENCODE research groups: hp://www.genome.gov/265252208

ENCODE summary slides and videos hps://www.genome.gov/27561910 hps://www.encodeproject.org/tutorials/

9 Roadmap Epigenomics

Defining the human epigenome

hp://www.roadmapepigenomics.org/

10 GTEx – Genotype-tissue expression

Discovering associaon between genotype (e.g., SNP) and gene expression in humans

hp://www.gtexportal.org/ 11

FANTOM – another international effort Discovery of funconal elements in H. sapiens (human) and M. musculus (mouse)

hp://fantom.gsc.riken.jp/12

Related project: mouse ENCODE ENCODE for M. musculus (mouse)

hp://www.mouseencode.org/13

Related project - modENCODE

ENCODE for D. melanogaster (fruit fly) and C. elegans (roundworm)

hp://www.modencode.org/ 14

What can we learn from ENCODE?

hps://www.encodeproject.org/ • Genome-wide data – RNA-seq, ChIP-seq, DNase-seq,… • Genome annotation – Gene annotation, enhancers, chromatin state,… • Experimental protocols and best practices • Bioinformatics analysis methods and software

Landt et al. (2012) Genome Research 15 ENCODE human cell lines • ENCODE data were mostly generated from human cell lines. Tier 1 cell lines are more widely assayed than tier 2 cell lines, and in turn are more widely assayed than tier 3 cell lines.

• Tier 1: GM12878 (lymphoblastoid cell line, from a female HapMap individual); H1-hESC (embryonic stem cells); K562 (leukemia cell line)

• Tier 2: 15 cell lines, including IMR90 (fetal lung fibroblasts), HUVEC (umbilical endothelial cells), HeLa-S3 (cervical carcinoma), MCF-7 (mammary gland), etc.

• Tier 3: 338 cell lines hps://genome.ucsc.edu/ENCODE/cellTypes.html 16 Experimental assays • DNA methylation – Methyl Array, RRBS

• Chromatin accessibility – DNase-seq, FAIRE-seq

• RNA binding – RIP tiling array, RIP-seq

• Chromatin modification – ChIP-seq

• TF binding – ChIP-seq

• Higher order chromatin structure – 5C, ChIA-PET hps://genome.ucsc.edu/ENCODE/dataMatrix/ • Replication encodeDataMatrixHuman.html – Repli-chip, Repli-seq 17 RNA-seq

18 RNA-seq reads in a browser

RNA-seq reads

Gene annotaon Applicaons: 1) Discovery of novel transcript or alternave promoter 2) Look at alternave splicing 3) Profiling of gene expression 19 ChIP-seq maps genome-wide DNA-protein interactions • ChIP = Chromatin- immunoprecipitation

• Enrich for sequence fragments that are bound by a specific protein – Transcription factors – Chromatin- associated proteins – Histone proteins

Park (2009) Nature Rev. Genecs 20 ChIP-seq

Histone with a modified tail nucleosome

TF DNA

ChIP-seq profiles Transcripon factor

Histone modificaon

Bulk nucleosome

Ho et al. (2011) Tag-based Approaches for Next Generaon Sequencing 21 Identifying chromatin landscape in human cell lines

Ernst et al. (2011) Nature Chroman state = a combinaon of histone modificaons 22 Chromatin state can differ between organisms

Ho et al. (2014) Nature 23 Accessing ENCODE data from UCSC Genome Browser Species Ref. genome assembly Search for gene, or genomic interval

24 Many genomic ‘tracks’ in UCSC GB

Click “Refresh”

Select track

25 ENCODE data and annotation on UCSC GB

BLAT result

Gene annotaon

Chroman state track

ENCODE track’s naming format: cell type and experiment: e.g., K562 H3K4me3

26 You can upload custom tracks into UCSC Genome Browser

My custom track Custom track data (BED format in this example)

27 Using encodeproject.org

28 29 30 31 Download data

32 33 34 Visualise data

35 36 Use RegulomeDB to idenfy all SNPs around NKX2-5, and predict of they are potenal regulatory elements (promoter/enhancers) Try chr5:172,644,666-172,676,755 at hp://regulomedb.org/

37 38 39 40 FactorBook: The ENCODE knowledgebase of transcripon factors hp://www.factorbook.org/

41 42 43