Improving Multitask Transcription Factor Binding Site Prediction with Base-Pair Resolution
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.05.29.446316; this version posted May 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. NetTIME: Improving Multitask Transcription Factor Binding Site Prediction with Base-pair Resolution a a,b, a,b,c,d, Ren Yi , Kyunghyun Cho ⇤, Richard Bonneau ⇤ aDepartment of Computer Science, New York University, New York, NY 10011, USA bCenter for Data Science, New York University, New York, NY 10011, USA cDepartment of Biology, New York University, New York, NY 10003, USA dCenter for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY 10010, USA Abstract Motivation: Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequenc- ing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because avail- able binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here we propose NetTIME, a multitask learning framework for predicting cell-type-specific transcription factor binding sites with base- pair resolution. Results: We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a prob- ability threshold and reduces classification noise. We compare our method’s predictive performance with several state-of-the-art methods, including DeepBind, BindSpace, and Catchitt, and show that our method outperforms previous methods under both supervised and transfer learning settings. Availability: NetTIME is freely available at https://github.com/ryi06/NetTIME Contact: [email protected] 1. Introduction DNA in vivo [1]. Experimentally determining cell- type-specific TF binding sites is made possible Genome-wide modeling of non-coding DNA se- through high-throughput techniques such as chro- quence function is among the most fundamental matin immunoprecipitation followed by sequencing and yet challenging tasks in biology. Transcrip- (ChIP-seq) [2]. Due to experimental limitations, tional regulation is orchestrated by transcription however, it is infeasible to perform ChIP-seq (or factors (TFs), whose binding to DNA initiates a related single-TF-focused experiments) on all TFs series of signaling cascades that ultimately deter- across all cell types and organisms [1]. Therefore, mine both the rate of transcription of their tar- computational methods for accurately predicting in get genes and the underlying DNA functions. Both vivo TF binding sites are essential for studying TF the cell-type-specific chromatin state and the DNA functions, especially for less well-known TFs and sequence a↵ect the interactions between TFs and cell types. ⇤To whom correspondence should be addressed. Preprint submitted to Elsevier bioRxiv preprint doi: https://doi.org/10.1101/2021.05.29.446316; this version posted May 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Multiple community crowdsourcing challenges role. This multitude of factors changes substan- have been organized by the DREAM Consortium1 tially from cell type to cell type. In vivo TF bind- to find the best computational methods for predict- ing site prediction, therefore, requires cell-type- ing TF binding sites in both in vitro and in vivo specific data such as chromatin accessibility and settings [3, 4]. These challenges also set the com- histone modifications. Convolutional neural net- munity standard for both processing data and eval- works as well as TF- and cell-type-specific embed- uating methods. However, top-performing methods ding vectors have both been used to learn cell-type- from these challenges have revealed key limitations specific TF binding profiles from DNA sequences in the current TF binding prediction community. and DNase-seq data [19]. The DREAM Consor- Generalizing predictions beyond the training panels tium also initiated the ENCODE-DREAM chal- of cell types and TFs can potentially benefit from lenge to systematically evaluate methods for pre- multitask learning and increased prediction reso- dicting in vivo TF binding sites [4]. Apart from lution. However, many existing methods still use carefully designed model architectures, top-ranking shallow single-task models. Predictions generated methods in this challenge also rely on extensively from these methods typically have low resolution, curated feature sets. One such method, called and they cannot achieve competitive performance Catchitt [20], achieves top performance by lever- for prediction of binding regions shorter than 50 aging a wide range of features including DNA se- base pairs (bp), although the actual TF binding quences, genome annotations, TF motifs, DNase- sites are considerably shorter [5]. seq, and RNA-seq. 1.1. Related work 1.2. Current limitations Early TF binding prediction methods such as Compendium databases such as ENCODE [21] MEME [6, 7] focused on deriving interpretable TF and Remap [22] have collected ChIP-seq data for a motif position weight matrices (PWMs) that char- large collection of TFs in a handful of well-studied acterize TF sequence specificity. Amid rapid ad- cell types and organisms [1]. Within a single or- vancement in machine learning, however, growing ganism, however, the ENCODE TF ChIP-seq col- evidence has suggested that sequence specificity lection is highly skewed towards only a few TFs in a can be more accurately captured through more ab- small collection of well-characterized cell lines and stract feature extraction techniques. For example, primary cell types (Figure. S1). Knowledge trans- a method called DeepBind [8] used a convolutional fer from well-known cell types and TFs is crucial neural network to extract TF binding patterns from for understanding less-studied cell types and TFs. DNA sequences. Several modifications to Deep- Top-performing methods from the ENCODE- Bind subsequently improved model architecture [9] DREAM Challenge typically adopt the single-task as well as prediction resolution [10]. Yuan et al. learning approach. For example, Catchitt [20] developed BindSpace, which embeds TF-bound se- trains one model per TF and cell type. Cross cell- quences into a common high-dimensional space. type transfer predictions are achieved by provid- Embedding methods like BindSpace belong to a ing a trained model with input features from a class of representation learning techniques com- new cell type. This approach can be highly un- monly used in natural language processing [12] and reliable as the chromatin landscapes between the genomics [13, 14] for mapping entities to vectors of trained and predicted cell types can be drastically real numbers. New methods also explicitly model di↵erent [23] and these di↵erences can be function- protein binding sites with multiple binding mode ally important [24]. Alternatively, each model can predictors [15], and the e↵ect of sequence variants be trained on one TF using cell-type-specific data on non-coding DNA functions at scale [16, 17, 18]. across multiple cell types of interest [25]. However, In general, the DNA sequence at a potential TF such models tend to assign high binding probabil- binding site is just the beginning of the full DNA- ities to common binding sites among training cell function picture, and the state of the surrounding types without proper mechanisms to distinguish cell chromosome, the TF and TF-cofactor expression, types. Very few methods have adopted the multi- and other contextual factors play an equally large task learning approach in which data from multiple cell types and TFs are trained jointly in order to improve the overall model performance. One mul- 1http://dreamchallenges.org/about-dream/ titask solution [16, 17, 18] involves training a multi- 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.05.29.446316; this version posted May 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. class classifier on DNA sequences, where each class To collect target labels that cover a wide range represents the occurrence of binding sites for one of cellular conditions and binding patterns, we first TF in one cell type. This solution is suboptimal as select 7 cell types (including 3 cancer cell types, 3 it cannot generalize predictions beyond the training normal cell types, and 1 stem cell type) and 22 TFs TF and cell-type pairs. (including 17 TFs from 7 TF protein families as Sequence context a↵ects TF binding affinity [26], well as 5 functionally related TFs.). Conserved and and increasing context size can improve TF binding relaxed peak sets are collected from 71 ENCODE site prediction [16]. TF binding sites are typically replicated ChIP-seq experiments conducted on our only 4-20 bp long [5]; increasing the prediction res- cell types and TFs of interest.