Virtual Chip-Seq: Predicting Transcription Factor Binding
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted February 28, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Virtual ChIP-seq: predicting transcription factor binding 2 by learning from the transcriptome 1,2,3 1,2,4,5 3 Mehran Karimzadeh and Michael M. Hoffman 1 4 Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada 2 5 Princess Margaret Cancer Centre, Toronto, ON, Canada 3 6 Vector Institute, Toronto, ON, Canada 4 7 Department of Computer Science, University of Toronto, Toronto, ON, Canada 5 8 Lead contact: michael.hoff[email protected] 9 February 28, 2018 10 Abstract 11 Motivation: 12 Identifying transcription factor binding sites is the first step in pinpointing non-coding mutations 13 that disrupt the regulatory function of transcription factors and promote disease. ChIP-seq is 14 the most common method for identifying binding sites, but performing it on patient samples is 15 hampered by the amount of available biological material and the cost of the experiment. Existing 16 methods for computational prediction of regulatory elements primarily predict binding in genomic 17 regions with sequence similarity to known transcription factor sequence preferences. This has limited 18 efficacy since most binding sites do not resemble known transcription factor sequence motifs, and 19 many transcription factors are not even sequence-specific. 20 Results: 21 We developed Virtual ChIP-seq, which predicts binding of individual transcription factors in new 22 cell types using an artificial neural network that integrates ChIP-seq results from other cell types 23 and chromatin accessibility data in the new cell type. Virtual ChIP-seq also uses learned asso- 24 ciations between gene expression and transcription factor binding at specific genomic regions. 25 This approach outperforms methods that use transcription factor sequence preferences in the 26 form of position weight matrices, predicting binding for 34 transcription factors (accuracy > 0:99; 27 Matthews correlation coefficient > 0:3). In at least one validation cell type, performance of Virtual 28 ChIP-seq is higher than all participants of the DREAM Challenge for in vivo transcription factor 29 binding site prediction in 4 of 9 transcription factors that we could compare to. 30 Availability: 31 The datasets we used for training and validation are available at https://virchip.hoffmanlab.org. 32 We have deposited in Zenodo the current version of our software (http://doi.org/10.5281/ 33 zenodo.1066928), datasets (http://doi.org/10.5281/zenodo.823297), and the predictions for 34 34 transcription factors on Roadmap cell types (http://doi.org/10.5281/zenodo.1066932). 1 bioRxiv preprint doi: https://doi.org/10.1101/168419; this version posted February 28, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 35 1 Introduction 36 Transcription factor (TF) binding regulates gene expression. Each TF can harmonize expression of 37 many genes by binding to genomic regions that regulate transcription. Cellular machinery utilizes 38 these master regulators to regulate key cellular processes and adapt to environmental stimuli. 39 Alteration in sequence or quantity of a given TF can impact expression of many genes. In fact, 40 these alterations can be the primary cause of hereditary disorders, complex disease, autoimmune 1 41 defects, and cancer . 42 TFs bind to accessible chromatin based on weak non-covalent interactions between amino acid 2 3 43 residues and nucleic acids. DNA's primary structure (sequence) , secondary structure (shape) , and 4 44 tertiary structure (conformation) all play roles in TF binding. Many TFs form a complex with 45 others as well as chromatin-binding proteins and therefore bind to DNA indirectly. Some TFs also 46 have different isoforms and undergo various post-translational modifications. In vitro assays, such 5 47 as high throughput systematic evolution of ligands by exponential enrichment (HT-SELEX) and 6 48 protein binding microarrays , have provided a compelling understanding of context-independent 7 49 TF sequence and shape preference . Yet, for the aforementioned reasons, performance of models 8,9 50 trained on these in vitro data are poor when applied on in vivo experiments . To address this 51 challenge, we must explore how to better model DNA shape, TF-TF interactions, and context- 52 dependent TF binding. 10 53 Chromatin immunoprecipitation and sequencing (ChIP-seq) and similar methods, such as 11 12 54 ChIP-exo and ChIP-nexus , can map the presence of a given TF in the genome of a biological 55 sample. To map TFs, these assays require a minimum of 1,000,000 to 100,000,000 cells, depending 56 on properties of the TF itself and available antibodies. Such large numbers of cells are not often 57 available from clinical samples. Therefore, it is impossible to systematically assess TF binding in 58 most disease systems. Assessing chromatin accessibility through transposase-accessible chromatin 13 59 using sequencing (ATAC-seq) , however, requires only hundreds or thousands of cells. One can 60 obtain this many cells from many more clinical samples. While chromatin accessibility does not de- 61 termine TF binding, several methods use this information together with knowledge of TF sequence 14,15,16 62 preference, genomic conservation, and other genomic features to predict TF binding . 63 Predicting TF binding with motif discovery tools within chromatin accessible regions has helped 17 64 us understand the role of several TFs in various disease. For example, He et al. used motif 65 discovery tools to identify role of OCT1 and NKX3-1 after prolonged androgen stimulation in 18 66 prostate cancer. Similarly, Bailey et al. discovered that a known breast cancer risk polymorphism 67 is an ESR1 binding site in its wild-type context. This ESR1 binding site is also a hotspot of 68 somatic non-coding mutations in its vicinity. We propose that using more accurate tools to predict 69 TF binding will allow understanding the role of TF binding in more contexts. 70 Previous studies have used various approaches to predict TF binding. Several methods use unsu- 14 15 71 pervised approaches such as hierarchical mixture models or hidden Markov models to identify 72 transcription factor footprint using chromatin accessibility data. These approaches use sequence 73 motif scores to attribute footprints to different transcription factors. Convolutional neural network 74 models can boost precision by learning sequence preferences from in vivo, rather than in vitro 20,21 75 data . Variation in sequence specificity and cooperative binding of some transcription factors 76 prevents these methods from accurately predicting binding of all transcription factors. A more re- 77 cent approach uses matrix completion to impute TF binding using a 3-mode tensor representing 22 78 genomic positions, cell types, and TF binding . This method doesn't rely on sequence specificity, 79 but can only predict TF binding in well-studied cell types with many ChIP-seq datasets. This 80 means one cannot use it to predict binding in a cell type where ChIP-seq is not possible, such as 81 limited clinical samples. 82 Identifying the best approach for predicting TF binding remains a challenge, because most 2 a100 75 50 ● 25 0 RB1 TBP PML MAZ SIX5 TAF1 ATF1 TAF7 ATF7 BMI1 MXI1 ZZZ3 E2F1 MTA3 MTA1 EZH2 ZHX1 PHF8 HSF1 ZHX2 BRF1 BRF2 RFX5 PBX3 UBTF RNF2 CBX2 CBX3 CBX8 CBX5 BDP1 CHD1 CHD7 CHD4 CHD2 IKZF1 SMC3 MBD4 SIRT6 ZMIZ1 SIN3B SIN3A CREM − NONO ZFP36 EP300 KAT2B KAT2A SUZ12 H3F3A SAP30 TFDP1 FOXP2 NELFE ZBED1 CTBP2 GTF2B RAD21 NR3C1 NR1H2 RBBP5 MYBL2 HDAC2 HDAC6 HDAC1 HCFC1 CCNT2 BRCA1 CREB1 CEBPZ KDM4A KDM5A KDM5B CEBPD KDM1A SMAD2 RCOR1 NCOR1 TRIM22 TRIM28 HMGN3 WHSC1 ZNF143 ARID3A NANOG NFATC1 GTF2F1 GTF3C2 WRNIP1 SREBF2 SETDB1 TARDBP POLR3A POLR2A POLR3G CREBBP TBL1XR1 HA ZC3H11A CREB3L1 SUPT20H ZKSCAN1 NEUROD1 100 ● 75 ● ● ● ● ● Peaks with sequence motif (%) Peaks ● 50 ● ● bioRxiv preprint ● 25 ● ● ● ● contrast, most MAFK peaksing both sequence contain motif its of sequenceto motif the TFs and same with show TF, motif centralenrichment such occupancy enrichment. as of ATF3, 50% centralwhere or enrichment more. of 50% the or motifTFs more is where of also less peaks low. thanIndividual have In 50% of the points: peaks sequence outliers have motifdian. beyond the Box (high sequence a range: motif motif interquartile whisker. (low range occupancy,distribution motif (IQR). among blue). Whisker: occupancy, datasets most green), from extreme and different valueChIP-seq TFs cell within peaks types quartile for and a TF replicates.Figure with Horizontal any line 1: JASPAR of sequence boxplot: motif from me- the TF's family. Boxplots show the 0 SP2 SP1 SP4 YY1 JUN SRF FOS MNT IRF4 IRF3 MYB MAX IRF2 SPI1 MYC TAL1 ATF2 ATF3 NFIC E2F4 ELF1 E2F6 KLF5 BATF PAX5 PAX8 ELK1 ELK4 TCF3 TCF7 EBF1 ZEB1 ETV6 ETS1 BCL3 USF1 NFE2 USF2 RFX1 NFYA PBX2 SOX6 NRF1 RELA JUND CTCF NFYB REST CUX1 MAFF EGR1 RXRA MAFK STAT1 STAT3 GFI1B MEIS2 GATA2 GATA3 GATA1 TCF12 NR2F2 FOSL2 FOSL1 FOXA1 FOXA2 CTCFL TEAD4 THAP1 HNF4A NR2C2 BACH1 GABPA MEF2A RUNX3 RUNX1 FOXM1 HNF4G MEF2C CEBPB ESRRA SMAD1 STAT5A ZNF384 ZNF217 ZNF274 ZNF263 TCF7L2 ZNF207 NFE2L2 ZBTB33 BCL11A ZBTB7A BCLAF1 POU5F1 SREBF1 POU2F2 BHLHE40 SMARCA4 SMARCB1 SMARCC1 doi: not certifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission.