Virtual Chip-Seq: Predicting Transcription Factor Binding
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Virtual ChIP-seq: predicting transcription factor binding 2 by learning from the transcriptome 1,2,3 1,2,3,4,5 3 Mehran Karimzadeh and Michael M. Hoffman 1 4 Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada 2 5 Princess Margaret Cancer Centre, Toronto, ON, Canada 3 6 Vector Institute, Toronto, ON, Canada 4 7 Department of Computer Science, University of Toronto, Toronto, ON, Canada 5 8 Lead contact: michael.hoff[email protected] 9 March 8, 2019 10 Abstract 11 Motivation: 12 Identifying transcription factor binding sites is the first step in pinpointing non-coding mutations 13 that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is 14 the most common method for identifying binding sites, but performing it on patient samples is 15 hampered by the amount of available biological material and the cost of the experiment. Existing 16 methods for computational prediction of regulatory elements primarily predict binding in genomic 17 regions with sequence similarity to known transcription factor sequence preferences. This has limited 18 efficacy since most binding sites do not resemble known transcription factor sequence motifs, and 19 many transcription factors are not even sequence-specific. 20 Results: 21 We developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new 22 cell types using an artificial neural network that integrates ChIP-seq results from other cell types 23 and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned associ- 24 ations between gene expression and transcription factor binding at specific genomic regions. This 25 approach outperforms methods that predict TF binding solely based on sequence preference, pre- 26 dicting binding for 36 transcription factors (Matthews correlation coefficient > 0:3). 27 Availability: 28 The datasets we used for training and validation are available at https://virchip.hoffmanlab. 29 org. We have deposited in Zenodo the current version of our software (http://doi.org/10. 30 5281/zenodo.1066928), datasets (http://doi.org/10.5281/zenodo.823297), predictions for 36 31 transcription factors on Roadmap Epigenomics cell types (http://doi.org/10.5281/zenodo. 32 1455759), and predictions in Cistrome as well as ENCODE-DREAM in vivo TF Binding Site 33 Prediction Challenge (http://doi.org/10.5281/zenodo.1209308). 1 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted March 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 34 1 Introduction 35 Transcription factor (TF) binding regulates gene expression. Each TF can harmonize expression of 36 many genes by binding to genomic regions that regulate transcription. Cellular machinery utilizes 37 these master regulators to regulate key cellular processes and adapt to environmental stimuli. 38 Alteration in sequence or quantity of a given TF can impact expression of many genes. In fact, 39 these alterations can be the primary cause of hereditary disorders, complex disease, autoimmune 1 40 defects, and cancer . 41 TFs bind to accessible chromatin based on weak non-covalent interactions between amino acid 2 3 42 residues and nucleic acids. DNA's primary structure (sequence) , secondary structure (shape) , and 4 43 tertiary structure (conformation) all play roles in TF binding. Many TFs form a complex with 44 others as well as chromatin-binding proteins and therefore bind to DNA indirectly. Some TFs also 45 have different isoforms and undergo various post-translational modifications. In vitro assays, such 5 46 as high throughput systematic evolution of ligands by exponential enrichment (HT-SELEX) and 6 47 protein binding microarrays , have provided a compelling understanding of context-independent 7 48 TF sequence and shape preference . Yet, for the aforementioned reasons, performance of models 8,9 49 trained on these in vitro data are poor when applied on in vivo experiments . To address this 50 challenge, we must explore how to better model DNA shape, TF-TF interactions, and context- 51 dependent TF binding. 10 52 Chromatin immunoprecipitation and sequencing (ChIP-seq) and similar methods, such as 11 12 53 ChIP-exo and ChIP-nexus , can map the presence of a given TF in the genome of a biological 54 sample. To map TFs, these assays require a minimum of 1,000,000 to 100,000,000 cells, depending 55 on properties of the TF itself and available antibodies. Such large numbers of cells are not often 56 available from clinical samples. Therefore, it is impossible to systematically assess TF binding in 57 most disease systems. Assessing chromatin accessibility through transposase-accessible chromatin 13 58 using sequencing (ATAC-seq) , however, requires only hundreds or thousands of cells. One can 59 obtain this many cells from many more clinical samples. While chromatin accessibility does not de- 60 termine TF binding, several methods use this information together with knowledge of TF sequence 14,15,16 61 preference, genomic conservation, and other genomic features to predict TF binding . 62 Predicting TF binding with motif discovery tools within chromatin accessible regions has helped 17 63 us understand the role of several TFs in various disease. For example, He et al. used motif discov- 64 ery tools to identify the role of OCT1 and NKX3-1 after prolonged androgen stimulation in prostate 18 65 cancer. Similarly, Bailey et al. discovered that a known breast cancer risk single nucleotide poly- 66 morphism (SNP) upstream of ESR1 disrupts GATA3 binding and enhances expression of ESR1. 67 We propose that using more accurate tools to predict TF binding will allow understanding the role 68 of TF binding in more contexts. 69 Previous studies have used various approaches to predict TF binding. Several methods use unsu- 14 15 70 pervised approaches such as hierarchical mixture models or hidden Markov models to identify 71 transcription factor footprint using chromatin accessibility data. These approaches use sequence 72 motif scores to attribute footprints to different transcription factors. Convolutional neural network 73 models can boost precision by learning sequence preferences from in vivo, rather than in vitro 20,21 74 data . Variation in sequence specificity and cooperative binding of some transcription factors 75 prevents these methods from accurately predicting binding of all transcription factors. A more re- 76 cent approach uses matrix completion to impute TF binding using a 3-mode tensor representing 22 77 genomic positions, cell types, and TF binding . This method doesn't rely on sequence specificity, 78 but can only predict TF binding in well-studied cell types with many ChIP-seq datasets. This 79 means one cannot use it to predict binding in a cell type where ChIP-seq is not possible, such as 80 limited clinical samples. 81 Identifying the best approach for predicting TF binding remains a challenge, because most 2 a100 75 50 25 0 RB1 TBP PML MAZ SIX5 TAF1 ATF1 TAF7 ATF7 BMI1 MXI1 ZZZ3 E2F1 MTA3 MTA1 EZH2 ZHX1 PHF8 HSF1 ZHX2 BRF1 BRF2 RFX5 PBX3 UBTF RNF2 CBX2 CBX3 CBX8 CBX5 BDP1 CHD1 CHD7 CHD4 CHD2 IKZF1 SMC3 MBD4 SIRT6 ZMIZ1 SIN3B SIN3A CREM − NONO ZFP36 EP300 KAT2B KAT2A SUZ12 H3F3A SAP30 TFDP1 FOXP2 NELFE ZBED1 CTBP2 GTF2B RAD21 NR3C1 NR1H2 RBBP5 MYBL2 HDAC2 HDAC6 HDAC1 HCFC1 CCNT2 BRCA1 CREB1 CEBPZ KDM4A KDM5A KDM5B CEBPD KDM1A SMAD2 RCOR1 NCOR1 TRIM22 TRIM28 HMGN3 WHSC1 ZNF143 ARID3A NANOG NFATC1 GTF2F1 GTF3C2 WRNIP1 SREBF2 SETDB1 TARDBP POLR3A POLR2A POLR3G CREBBP TBL1XR1 HA ZC3H11A CREB3L1 SUPT20H ZKSCAN1 NEUROD1 100 75 Peaks with sequence motif (%) Peaks 50 bioRxiv preprint 25 low. In contrast, most MAFKmatching peaks sequence both motif contain ofcompared its to the sequence TFs motif same with and TF, motif(c) show occupancy such central of as enrichment. 50% ATF3, orgreen), more. central and enrichment TFs where of 50%JASPAR the or (red), more motif TFs of is where peaksIndividual also less have the points: than sequence 50% outliers motif of (high beyonddian. motif peaks Box a occupancy, range: have blue). whisker. interquartile the range sequencedistribution (IQR). among motif Whisker: datasets most (low from extreme motif different valueChIP-seq occupancy, cell within peaks types quartile for and a TF replicates.Figure with Horizontal any line 1: JASPAR of sequence boxplot: motif from me- the TF's family. Boxplots show the 0 Central enrichment SP2 SP1 SP4 YY1 JUN SRF FOS MNT IRF4 IRF3 MYB MAX IRF2 SPI1 MYC TAL1 ATF3 ATF2 NFIC E2F4 ELF1 E2F6 KLF5 BATF PAX5 PAX8 ELK1 ELK4 TCF3 TCF7 EBF1 ZEB1 ETV6 ETS1 BCL3 USF1 NFE2 USF2 RFX1 NFYA PBX2 SOX6 NRF1 RELA JUND CTCF NFYB REST CUX1 MAFF EGR1 RXRA MAFK STAT1 STAT3 GFI1B MEIS2 GATA2 GATA1 GATA3 TCF12 NR2F2 FOSL2 FOSL1 FOXA1 FOXA2 CTCFL TEAD4 THAP1 HNF4A NR2C2 BACH1 GABPA MEF2A RUNX3 RUNX1 FOXM1 HNF4G MEF2C CEBPB SMAD1 ESRRA STAT5A ZNF384 ZNF217 ZNF274 ZNF263 TCF7L2 ZNF207 NFE2L2 ZBTB33 BCL11A ZBTB7A BCLAF1 POU5F1 POU2F2 SREBF1 BHLHE40 SMARCA4 SMARCB1 SMARCC1 doi: Most ChIP-seq peaks lack the TF's sequence motif. (a) Chromatin factors with ChIP-seq data in ENCODE dataset certified bypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. https://doi.org/10.1101/168419 Motif occupancy No sequence motif 0% < Motif occupancy < 50% Motif occupancy ≥ 50% JASPAR MA0496.1 MAFK b d 0.04 Peaks with motif = 51569 2.0 No sequence motif 1.5 1.0 Bits 0.5 0.0 0% < Motif occupancy < 50% 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.03 of a TF's motif is lower for TFs with motif occupancy of less than 50% JASPAR MA0605.1 Atf3 Peaks with motif = 880 2.0 1.5 Motif occupancy ≥ 50% 1.0 Bits 0.5 0.0 0 20 40 60 80 1 2 3 4 5 6 7 8 Number of chromatin factors 0.02 ; this versionpostedMarch12,2019.