Overview ofOverview See last slide for references & info. more for references last slide See ENCODE Elements ENCODE Lectures.GersteinLab.org & “ Slides freely downloadable from downloadable freely Slides tweetable Annotation & on behalfof M Gerstein M Gerstein ” (via ” @ (via

markgerstein

).

1 - Lectures.GersteinLab.org (Jul.- (Jul.- Aug. '06)] Similarity Lines are Function Color is text? annotate a human How might we [B Hayes, Am. Sci.

DONTKNOWTHEBESTNOTATIONˆOREVEN WESTILLHAVENTGOTTENITRIGHT7ESTILL LANGUAGESGOESONANDONARGUESTHAT THEVERYFACTTHATTHEPROLIFERATIONOF SOMETHINGDEMONSTRABLYBETTER"UT ITSOWNSAKETHEYARETRYINGTOMAKE GUAGEARENOTJUSTADDINGVARIETYFOR CREATORSOFANEWPROGRAMMINGLAN GINEERINGRATHERTHANEVOLUTION4HE WHICHAREPRODUCTSOFDESIGNOREN WHENITCOMESTOCOMPUTERLANGUAGES TACHEDREVERENCEISHARDERTOSUSTAIN ANDDIVERSIFY ALLTHECOMPUTERLANGUAG GUAGESHAVEHADMILLENNIATOEVOLVE KEEPINMINDTHATWHEREASHUMANLAN EVENHIGHER ATMORETHAN !ND TOTAL THE PUTS 0IGGOTT $IARMUID BY LANGUAGES!NOTHERSURVEY COMPILED +ANSASLISTSABOUT PROGRAMMING BY"ILL+INNERSLEYOFTHE5NIVERSITYOF 4HISONEISIN# !CATALOGMAINTAINED THEWAYSTOSAY ACHALLENGEAHEADOFYOU LEARNINGALL GLOTPROGRAMMER YOUALSOHAVEQUITE %THNOLOGUECOM PEOPLESOFPLANET%ARTH ACCORDINGTO LANGUAGESKNOWNTOBESPOKENBYTHE TOILET PLEASE v4HATSTHENUMBEROF LEARN WAYSTOSAYh7HEREISTHE ) ONAVERAGE EVERSINCE BEENINVENTINGONELANGUAGEAWEEK THE+INNERSLEYCOUNT THATMEANSWEVE BYTHEMORE CONSERVATIVESTANDARDSOF ESHAVESPRUNGUPINJUSTYEARS%VEN MORETHEBETTER4HATATTITUDEOFDE !LLHUMANLANGUAGESAREVALUABLETHE ANDPRESERVED MUCHLIKEBIODIVERSITY ISACULTURALRESOURCETOBENURTURED WWWAMERICANSCIENTISTORG $URHAM .#)NTERNETBHAYES AMSCIORG !VENUE $ACIAN  !DDRESS HTTPBIT PLAYERORG ING3CIENCEvCOLUMNAPPEARSIN(AYESSWEBLOGAT ENTIST "RIAN(AYESIS3ENIOR7RITERFOR )FYOUWANTTOBETHECOMPLETEPOLY &ORETHNOLOGISTS LINGUISTICDIVERSITY GOINGWORLDTRAVELER YOUNEEDTO BE TO WANT YOU F QSJOUG  !DDITIONALMATERIALRELATEDTOTHEh#OMPUT  IFMMP  &ORTRAN XPSME ATHOROUGH ! =O MERICAN3CI    / iÊ-i“ˆVœœ˜Ê7>Àà ©"RIAN(AYES2EPRODUCTIONWITHPERMISSIONONLY

#OMPUTING3CIENCE SIDE#OMMUNITIESSEPARATEDBYDIF *AVA # ,UA (ASKELL 0ROLOG #URL SAYSONE.O ITS0YTHON)TS2UBY)TS FAVORADIFFERENTLANGUAGE)TS,ISP COURSE ISTHATEACHOFTHESEFRIENDSWILL TOREFINEANDIMPROVEIT4HECATCH OF ANDNOWWESHOULDALLWORKTOGETHER DATIONANDSOLVESTHEMAINPROBLEMS CONCEDE BUTITSBUILTONASOUNDFOUN GUAGEMAYNOTYETBEPERFECT THEYLL ISWILLFULIGNORANCE4HEONETRUELAN GUAGE ANDFORMETOCLAIMOTHERWISE LAN PROGRAMMING RIGHT THE FOUND INDEED HAVE WE VEHEMENTLYˆTHAT 4HEYWILLARGUEˆZEALOUSLY ARDENTLY STATEMENT)EXPECTTOHEARFROMTHEM QUAINTANCEWHOWILLDISPUTETHATLAST STRUCTURE INGANALGORITHMORDEFININGADATA  PERISHEDINAWARFOUGHTTODE LIPUTANDTHE"IG %NDIANSOF"LEFUSCU PUTEBETWEENTHE,ITTLE %NDIANSOF,IL )N*ONATHAN3WIFTTOLDOFADIS EXCHANGEDINMANYLANGUAGES  BLOODSHED BUTHARSHWORDSHAVEBEEN THECONFLICTSHAVENEVERCOMETOACTUAL IANSTRIFEANDSCHISM!SFARAS)KNOW FESSIONSHAVEHADTHEIRSHAREOFSECTAR WAR TORNCOUNTRIES THECOMPUTINGPRO TIONvCOMESTOMIND!ND LIKEWEARY ALONGPEACEABLYTHETERMh"ALKANIZA FERENCESOFLANGUAGEDONTALWAYSGET AGOOD ENOUGHNOTATIONˆFOREXPRESS 4HEREAREPROGRAMMERSOFMYAC 3ADLY LINGUISTICDIVERSITYHASADARK LANGUAGE!NEWONE %VERYPROGRAMMER TRUEPROGRAMMING KNOWSTHEREISONE #ONTACTBHAYES AMSCIORG EVERYWEEK / iÊ ˜`ˆ>˜Ê7>Àà "RIAN(AYES

PROGRAMSAREALSOPEPPEREDWITHSEMI STATEMENTENDSANDTHENEXTBEGINS# SEMICOLONSTELLTHECOMPILERWHEREONE #OHENOFTHE5NIVERSITYOF3OUTHERN WARSWASFIRSTPOINTEDOUTBY$ANNY ENDSATBOUNDARIESBETWEENSYSTEMS ANDSOFTWAREISNEEDEDJUSTTOSWAP PENED ANDSOQUITEALOTOFHARDWARE CHOICE"UTTHATS BEEASIERIFEVERYONEMADETHESAME )THARDLYMATTERS ALTHOUGHLIFEWOULD BITCANGOFIRST7HICHWAYISBETTER SIGNIFICANTBITORTHEMOST SIGNIFICANT STOREDORTRANSMITTED EITHERTHELEAST TIONSPROTOCOLS7HENABLOCKOFDATAIS COMPUTERHARDWAREANDCOMMUNICA REPLAYEDYEARSLATERBYDESIGNERSOF 4HISFAMOUSTEMPESTINANEGGCUPWAS CIDEWHICHENDOFABOILEDEGGTOCRACK BERTHEM   ORSHOULDITBE   HAVEALISTOFTHREEITEMS$OWENUM THEMEEKESTPROGRAMMER3UPPOSEWE BRINGSOUTTHESNARLINGDOGMATISMIN TIOUSISSUEISHOWTOCOUNT4HISONE CALASASERIOUSPROGRAMMINGTOOL SEENASTHEBEGINNINGOFTHEENDOF0AS LYIN WASNEVERPUBLISHED ITCIRCULATEDWIDE GUAGEv!LTHOUGH+ERNIGHANSPAPER ISNOTMYFAVORITEPROGRAMMINGLAN ,ABSINADIATRIBE h7HY0ASCAL BY"RIAN7+ERNIGHANOF!44"ELL DISCREPANCYWASONEOFTHEGRIPESCITED LASTSTATEMENT BUT0ASCALDOESNT4HIS FERENCE #NEEDSASEMICOLONAFTERTHE MINATORS COLONS BUTIN#THEYARESTATEMENT EXAMPLE IN HAVETOBESEPARATEDBYSEMICOLONS&OR !LGOLAND0ASCAL PROGRAMSTATEMENTS TREATYˆFOCUSEDONTHESEMICOLON)N )THINK BUTNEVERSETTLEDBYTRUCEOR MIREDTHEPLEAFORPEACEWASIGNORED IN PUBLISHED SUBSEQUENTLY MEMO 4HE h/NHOLYWARSANDAPLEAFORPEACEv #ALIFORNIAINABRILLIANTMEMO 4HISMODERNECHOOF3WIFTS%NDIAN TLAOHR EENALCONTEN PERENNIALLY ANOTHER 3TILL FORGOTTEN FEUDˆLARGELY !NOTHER #OMPUTER SAMIZDAT NOTSEPARATORS7HATSTHEDIF WASWIDELYREADANDAD Y ANDINRETROSPECTITCANBE     NOT Z   WHATHASHAP *ULYn!UGUST Y

 [     THE TER

2 - Lectures.GersteinLab.org Sequence features, incl. features, incl. Sequence features to non-coding "tracks" related of information collections several are There Non-coding Annotations: Overview Conservation Conservation and ChIP-seq Functional Genomics ncRNA ( Epigenome & un-annotated transcription [Nat. Rev. Genet.(2010)11:559] & seq. specific TF)

3 - Lectures.GersteinLab.org Signal Track for RNA-seq Signal& for RNA-seq Track ChIP [ PLOS CB -seq 4:e1000158] 4:e1000158]

4 - Lectures.GersteinLab.org Annotations Functional Genomics E) HOT/LOT REGIONS ChromHMM INCLUDING ENHANCERS SEGMENTATIONS, D) UNSUPERVISED (Supervised) C) ENHANCERS FANTOM CAGE support) GENCODE Annotated B) PROMOTERS Regions 3. Blacklist line) TFwith cell and annotated are peaks (individual processing from uniform of TF compilation with browser, track genome at the UCSC 2. The regulation lines} cell many {on 1. A) PEAKS DNase peaks at the UCSC genome browser browser genome at the UCSC peaks , SegWay

, TSSes HiHMM (also, (also, .... ChIP TSSes -seq peaks-seq with 3. TADs: Topologically Associated Domain Domain Associated 3. TADs: Topologically 2. TF-targetconnectivity network connection gene 1. Enhancer-target F) CONNECTIVITY 2. Models SNPs & Regions 1. List of Allelic I) OTHER junctions 3. Novel or TARs) Regions ( transcripts novel RNA 2. Novel lines} cell many {on ncRNAs known & genes for protein-coding (or exons) genes of data known 1. A matrix of expression H) RNA for TF binding G) MOTIFS

contigs ie Transcriptionally Active Transcriptionally track, i.e., possible track, i.e., possible

5 - Lectures.GersteinLab.org His. Marks Networks Networks TFs with (broad) Control Peaks Higher level Informaon from K562 H3K36me3S1 K562 H3K4me3S K562 mIgGSig K562 c-MycSig K562 Pol2Sig K562 c-MycPk RefSeq Genes K562 Pol2Pk Scale chr1: 925000

Active marks in EE 012 -1k +1.5k-1.5k+1k 930000 TSS TES b top 20% Autosomal o Transcript Level ttom 20% H3K4me2 10 kb 935000 940000 ChIP-seq Aggregations 945000 HMs: UW ]& Broad TFs & Control: Yale Sources + Data ENCODE [ Science Science 330: 1775 950000

6 binding to typical relative & strength location signal stat.combining on Based model regulation & distal edges confidence high Filtering Assigning sites to targets TF binding Calling Peak TF Edge Distal Potential Data Flow:Data to peaks proximal & distal networks TF Edge Proximal Strong Edges Edges ~500K Yip et al., al., et Yip Nature [Cheng etal., 489:91 ('12), doi:10.1038/nature11245 489:91 GenomeBiology Bioinfo . ('11); ('12)] ;

7 - Lectures.GersteinLab.org DNase Peaks & Open Chroman

Peak

Groop L. Nature genecs, 2010, 42(3): 190-192. DNase hypersensitivity as a mark of functionality

) ChIP-Seq ( Transcription

Thurman et al Nature 2012

H3K27ac is an important mechanism to regulate the acvity of enhancers in different developmental stages

Epigenecally, H3K27ac marks are present near acve enhancers. Nord, et al., Cell, 2013 Genome Segmentaons

Unsupervised segmentation of chromatin features groups regions with similar patterns and labels each pattern, thus, annotating the genome.

Hoffman, et al, Nature Methods, 2012 Ernst & Kellis, Nature Methods, 2012. Du et al (in revision). Mapped reads show isoform composion in two different stages. • browser. Signal tracks for two genes are shown. Figure made using UCSC genome Internaonal Journal of Cancer “Associaon of Mertz, Kirsten D., Francesca cytokeran 7 and 19 expression with genomic stability and favorable prognosis in clear cell renal cell cancer.” Demichelis 123, no. 3 (2008): 569-576. , Andrea Sboner, Michelle S. Hirsch, Paola Dal

Cin , Kirsten Avg. signal at exons & Struckmann Informaon from "TARs" (RPKMs) [ PNAS Higher level , Marna RNA-seq: 4:107:5254 ; Storz , et al. between conditions. conditions. between differentially elements expressed as lists of as well transcripts chimeric ~50 genes, expressed (TARs), allele-specific ~2,000 Transcriptionally Active Regions novel ~15,000 on information splices. million store would Also, we ~2 and exons, transcripts, ~200,000 ~70,000 genes, RPKMs) of ~20,000 (i.e. levels the expression keep: would one experiments RNA-Seq IJC 123:569]

12 Simplified

Comprehensive (published annotaon, mostly in '12 & '14 rollouts) • • • •

“Tissue type” facet for the cell lines (DCC) (DCC) type” facet lines “Tissue for the cell TSS list matrix expression Gene of subset the annotations close-to-data the ENCODE, providing through "Slice" - -

GENCODE v19 GENCODE v19 GENCODE 19 total) in in lines cell (~60 lines cell ENCODE2 over "Simplified" Annotation

14 - Lectures.GersteinLab.org Genome Browser Tools Mirrors Downloads My Data View Help About Us UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x 100x

chr9:73,193,573-73,210,350 16,778 bp. enter position, gene symbol or search terms go

move start Click on a feature for details. Click or drag in the base position track to move end zoom in. Click side bars for track options. Drag side bars or labels up or < 2.0 > < 2.0 > down to reorder tracks. Drag tracks left or right to new position. track search default tracksSimplified regulatory sites default order hide all manage custom tracks track hubs configure reverse resize refresh Use drop-down controls below and press refresh to alter tracks displayed. collapse all expand all Tracks with lots of items will automatically be displayed in more compact modes. Custom Tracks refresh distal DHS• Candidate enhancersproximal DHS: The master list of TSS-distal H3K27ac distal DHS distal peaks annotated with TF proximal TF dense dense dense pack pack • H3K27ac enrichment (percenle over background) in a cell-type-specific manner. Mapping and Sequencing refresh

deCODE Base Position •Assembly BAC End Pairs BU ORChID Chromosome Band TF ChIP-seq peaks across cell-types Recomb hide hide hide hide hide hide ENCODE• Candidate promoters: The master list of TSS-proximal DHS peaks annotated with GRC FISH Clones Fosmid End Pairs Gap GC Percent Pilot TF ChIP-seq peaks across cell types. Incident hide hide hide hide hide hide GRC Map GRC Patch Hg18 Diff Hg38 Diff Hi Seq Depth INSDC Contigs Release hide hide hide hide hide hide LRG Regions Map Contigs Mappability Recomb Rate Restr Enzymes Short Match hide hide hide hide hide hide STS Markers Wiki Track hide hide

Genes and Gene Predictions refresh UCSC Genes RefSeq Genes AceView Genes CCDS Ensembl Genes EvoFold hide hide hide hide hide hide IKMC Genes Exoniphy GENCODE... Geneid Genes Genscan Genes H-Inv 7.0 Mapped hide hide hide hide hide hide

ORFeome lincRNAs... LRG Transcripts MGC Genes N-SCAN Old UCSC Genes Clones hide hide hide hide hide hide Pfam in UCSC Other RefSeq Retroposed Genes SGP Genes SIB Genes sno/miRNA Gene hide hide hide hide hide hide TransMap... tRNA Genes UCSC Alt Events Vega Genes Yale Pseudo60 hide hide hide hide hide

Phenotype and Literature refresh

mRNA and EST refresh Human mRNAs Spliced ESTs CGAP SAGE Gene Bounds H-Inv Human ESTs hide hide hide hide hide hide Human RNA SIB Alt- Other ESTs Other mRNAs PolyA-Seq Editing Poly(A) Splicing hide hide hide hide hide hide UniGene Genomes Genome Browser Tools Mirrors Downloads My Data View Help About Us UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x 100x

chr9:73,193,573-73,210,350 16,778 bp. enter position, gene symbol or search terms go

Genomes Genome Browser Tools Mirrors Downloads My Data View Help About Us UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x 100x

chr9:73,193,573-73,210,350 16,778 bp. enter position, gene symbol or search terms go

move start Click on a feature for details. Click or drag in the base position track to move end zoom in. Click side bars for track options. Drag side bars or labels up or < 2.0 > < 2.0 > down to reorder tracks. Drag tracks left or right to new position. track search default tracks default order hide all manage custom tracks track hubs configure reverse resize refresh Use drop-down controls below and press refresh to alter tracks displayed. collapse all expand all Tracks with lots of items will automatically be displayed in more compact modes. Custom Tracks refresh distal DHS proximal DHS distal H3K27ac distal TF proximal TF dense dense dense pack pack

Mapping and Sequencing refresh deCODE Base Position Assembly BAC End Pairs BU ORChID Chromosome Band Recomb hide hide hide hide hide hide ENCODE GRC FISH Clones Fosmid End Pairs Gap GC Percent Pilot Incident hide hide hide hide hide hide GRC Map GRC Patch Hg18 Diff Hg38 Diff Hi Seq Depth INSDC Contigs Release hide hide hide hide hide hide Click on a feature for details. Click or drag in the base position track to moveLRG Regionsstart Map Contigs Recomb Rate Restr Enzymes Shortmove Match end zoom in. Click side Mappabilitybars for track options. Drag side bars or labels up or < 2.0 > < 2.0 > hide hidedown to reorder tracks.hide Drag tracks left hideor right to new position.hide hide trackSTS search Markers defaultWiki tracks Track default order hide all manage custom tracks track hubs configure reverse resize refresh hide hideUse drop-down controls below and press refresh to alter tracks displayed. collapse all expand all Tracks with lots of itemsGenes will automaticallyand Gene Predictions be displayed in more compact modes. refresh UCSC Genes RefSeq Genes AceView GenesCustom TracksCCDS Ensembl Genes EvoFoldrefresh distalhide DHS proximalhide DHS distalhide H3K27ac distalhide TF proximalhide TF hide dense dense pack pack pack IKMC Genes Exoniphy GENCODE... Geneid Genes Genscan Genes H-Inv 7.0 Mapping and Sequencing Mappedrefresh hide hide hide hide hide hide deCODE Base Position Assembly BAC End Pairs BU ORChID Chromosome Band Recomb hide hide hide hide ORFeome lincRNAs... LRG Transcripts MGC Genes N-SCANhide Old UCSC Genes Cloneshide hide hide hide hide hide ENCODE FISH Clones Fosmid End Pairs Gap GC Percent GRChide Incident Pilot Pfamhide in UCSC hide hide hide Other RefSeq Retroposed Genes SGP Genes SIB Genes sno/miRNA hide Gene hide hide hide hide hide hide GRC Map GRChide Patch Hg18 Diff Hg38 Diff Hi Seq Depth INSDC TransMap...Contigs tRNARelease Genes UCSC Alt Events Vega Genes Yale Pseudo60 hide hide hide hide hidehide hide hide hide hide

LRG Regions Map Contigs MappabilityPhenotype and LiteratureRecomb Rate Restr Enzymes Short Matchrefresh hide hide hide hide hide hide mRNA and EST refresh STS Markers Wiki Track Humanhide mRNAs Splicedhide ESTs CGAP SAGE Gene Bounds H-Inv Human ESTs hide hide hide hide hide hide Genes and Gene Predictions refresh Human RNA SIB Alt- Other ESTs Other mRNAs PolyA-Seq EditingUCSC Genes RefSeq Genes AceView Genes CCDS Poly(A) Ensembl Genes Splicing EvoFold hide hide hide hide hidehide hide hide hide hide hidehide IKMC Genes UniGeneExoniphy GENCODE... Geneid Genes Genscan Genes H-Inv 7.0 Mapped hide hide hide hide hide hide ORFeome lincRNAs... LRG Transcripts MGC Genes N-SCAN Old UCSC Genes Clones hide hide hide hide hide hide Pfam in UCSC Other RefSeq Retroposed Genes SGP Genes SIB Genes sno/miRNA Access candidate genomic annotaons via encodeproject.org on the "Data" menu bar

encodeproject.org/data/annotations •

Default Outline Level 1 Level Outline Default -

Level 2 Level Default Default Theme

18 - Lectures.GersteinLab.org Details of DNase peaks, H3K27ac annotaon and TF ChIP-seq annotaons

DNase peak detail

H3K27ac annotaon

TF annotaon Peak Calling

ChIP • Generate and threshold the signal profile and idenfy candidate target regions – Simulaon (PeakSeq), – Local window based Poisson (MACS), Threshold – Fold change stascs (SPP)

Potenal Targets

Normalized Control

• Score against the control

Significantly Enriched targets hp:// www.integratedhealthcare.eu /1/en/ histones_and_chroman

DNA. nucleosome, interconnected by secons of linker the is chroman of a cell. The basic repeat element of make up the contents of the nucleus that proteins and or DNA of combinaon complex the is Chroman * Bremner Lab website /

21 - Lectures.GersteinLab.org